I'm having an issue with Elasticsearch where my dense_vector data is taking up significantly more storage space than expected.
ES Version
8.2.3
Data Model:
vector
field: Type is dense_vector with a dimension of 1024. The data is stored as a 32-bit floating-point number, with values falling within the range of -0.1 to 0.1. Theoretically, each vector should occupy 4KB,- Indexed 1000 documents, which should result in a total storage of 4MB.
Actual Storage:
- Actual storage usage is 12MB.
- The
_source
field is taking up 8MB, and I'm unsure why.
index mapping
{
"test_vector_v1": {
"mappings": {
"properties": {
"vector": {
"type": "dense_vector",
"dims": 1024
}
}
}
}
}
disk storage
{
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"test_vector_v1": {
"store_size": "12mb",
"store_size_in_bytes": 12649263,
"all_fields": {
"total": "12mb",
"total_in_bytes": 12642767,
"inverted_index": {
"total": "16.9kb",
"total_in_bytes": 17348
},
"stored_fields": "8.1mb",
"stored_fields_in_bytes": 8522858,
"doc_values": "3.9mb",
"doc_values_in_bytes": 4101499,
"points": "1kb",
"points_in_bytes": 1062,
"norms": "0b",
"norms_in_bytes": 0,
"term_vectors": "0b",
"term_vectors_in_bytes": 0
},
"fields": {
"_id": {
"total": "35kb",
},
"_primary_term": {
"total": "0b",
},
"_seq_no": {
"total": "2.5kb",
},
"_source": {
"total": "8.1mb",
"total_in_bytes": 8504327,
"inverted_index": {
"total": "0b",
"total_in_bytes": 0
},
"stored_fields": "8.1mb",
"stored_fields_in_bytes": 8504327,
"doc_values": "0b",
"doc_values_in_bytes": 0,
"points": "0b",
"points_in_bytes": 0,
"norms": "0b",
"norms_in_bytes": 0,
"term_vectors": "0b",
"term_vectors_in_bytes": 0
},
"vector": {
"total": "3.9mb",
"total_in_bytes": 4100000,
"inverted_index": {
"total": "0b",
"total_in_bytes": 0
},
"stored_fields": "0b",
"stored_fields_in_bytes": 0,
"doc_values": "3.9mb",
"doc_values_in_bytes": 4100000,
"points": "0b",
"points_in_bytes": 0,
"norms": "0b",
"norms_in_bytes": 0,
"term_vectors": "0b",
"term_vectors_in_bytes": 0
}
}
}
}
Optimization Attempts:
- Disabled doc_values: Not supported for dense_vector.
- Removed the
vector
field from _source: This creates a new field, disabled _source: but this creates a_recovery_source
field, which also takes up 8MB.
PUT /test_vector_v2
{
"mappings": {
"properties": {
"vector": { "type": "dense_vector", "dims": 1024 },
"file_id": { "type": "keyword" }
},
"_source": { "includes": [ "file_id" ] }
}
}
I'm looking for ways to reduce the storage overhead of my dense_vector data in Elasticsearch. Is it feasible to achieve a 4MB storage limit?
Thank you for your suggestion!!