Understanding Storage Overhead in Elasticsearch for Vector Data

Problem Description:
This is the Elasticsearch index mapping. Theoretically, storing a vector with float32 values should take up 4KB of storage. Therefore, each document record should occupy approximately 4KB.
However, after writing 7.5 million records, the total storage used is 90GB, with 60GB for data and 30GB for doc_values , which is about 1/3 of the storage. Each record occupies 12.58KB.

  • chunk mapping
PUT /chunk_vector
{
    "mappings": {
        "dynamic": "strict",
        "properties": {
            "chunk_id": {
                "type": "long"
            },
            "file_id": {
                "type": "keyword"
            },
            "file_name": {
                "type": "keyword",
                "doc_values": false
            },
            "group_id": {
                "type": "keyword"
            },
            "vector": {
                "type": "dense_vector",
                "dims": 1024
            },
            "ctime": {
                "type": "long"
            },
            "mtime": {
                "type": "long"
            }
        }
    }
}
  • chunk storage:
   "chunk_vector": {
        "store_size": "89.9gb",
        "store_size_in_bytes": 96627326548,
        "all_fields": {
            "total": "89.9gb",
            "total_in_bytes": 96616074283,
            "inverted_index": {
                "total": "53.5mb",
                "total_in_bytes": 56200185
            },
            "stored_fields": "60.1gb",
            "stored_fields_in_bytes": 64585932961,
            "doc_values": "29.7gb",
            "doc_values_in_bytes": 31897294917,
            "points": "73mb",
            "points_in_bytes": 76646220,
            "norms": "0b",
            "norms_in_bytes": 0,
            "term_vectors": "0b",
            "term_vectors_in_bytes": 0
        },

I would like to optimize the storage of vectors. Please provide some suggestions and help resolve my queries. Thank you.
Questions:

  1. For vector search, is it necessary to use the doc_values feature?
  2. Why does each record occupy 12.58KB?
  3. How can we optimize vector storage space while keeping the vector dimension at 1024?

note: elasticsearch versoin: 8.2.3

  1. That's a really old version of Elasticsearch. Especially around dense_vector there have been a lot of improvements so I'd strongly recommend to upgrade to a more recent version.
  2. You'll have the dense_vector both in the indexed datastructure (HNSW) and the _source. So you could either exclude it from _source (but then you won't be able to reindex the data any more as one major downside of that approach). Or you could enable synthetic source (but that is not a GA feature for non-TSDB indices). See Tune approximate kNN search | Elasticsearch Guide [8.14] | Elastic for recommendations and tradeoffs.
  3. You have more fields, so each document will be larger than just the dense_vector. But _disk_usage should give you a pretty good rundown of where the space is used. My guess would be that synthetic source or excluding the dense_vector from source would be the biggest improvement you can make here; every other tuning will only give you some smaller improvements and also tradeoffs (like compression, tuning field mappings,...)
1 Like

First,thank you for your answer , I still have a question.

  1. I'll try it later.
  2. good idea, I'll understand it better later.
  3. I used the _disk_usage query and found that the _source field occupies 60GB, including doc_values . Additionally, why does the vector field separately occupy another 30GB?
{
    "_shards": {
        "total": 90,
        "successful": 90,
        "failed": 0
    },
    "chunk_vector": {
        "store_size": "91.4gb",
        "store_size_in_bytes": 98165022949,
        "all_fields": {
            "total": "91.4gb",
            "total_in_bytes": 98153614161,
            "inverted_index": {
                "total": "54.7mb",
                "total_in_bytes": 57380635
            },
            "stored_fields": "61.1gb",
            "stored_fields_in_bytes": 65610336099,
            "doc_values": "30.1gb",
            "doc_values_in_bytes": 32407647278,
            "points": "74.6mb",
            "points_in_bytes": 78250149,
            "norms": "0b",
            "norms_in_bytes": 0,
            "term_vectors": "0b",
            "term_vectors_in_bytes": 0
        },
        "fields": {
            "__soft_deletes": {
                "total": "282.8kb",
                "total_in_bytes": 289676,
                "inverted_index": {
                    "total": "0b",
                    "total_in_bytes": 0
                },
                "stored_fields": "0b",
                "stored_fields_in_bytes": 0,
                "doc_values": "282.8kb",
                "doc_values_in_bytes": 289676,
                "points": "0b",
                "points_in_bytes": 0,
                "norms": "0b",
                "norms_in_bytes": 0,
                "term_vectors": "0b",
                "term_vectors_in_bytes": 0
            },
            "_field_names": {
                "total": "1mb",
                "total_in_bytes": 1050826,
                "inverted_index": {
                    "total": "1mb",
                    "total_in_bytes": 1050826
                },
                "stored_fields": "0b",
                "stored_fields_in_bytes": 0,
                "doc_values": "0b",
                "doc_values_in_bytes": 0,
                "points": "0b",
                "points_in_bytes": 0,
                "norms": "0b",
                "norms_in_bytes": 0,
                "term_vectors": "0b",
                "term_vectors_in_bytes": 0
            },
            "_id": {
                "total": "334.6mb",
                "total_in_bytes": 350866540,
                "inverted_index": {
                    "total": "39.4mb",
                    "total_in_bytes": 41361917
                },
                "stored_fields": "295.1mb",
                "stored_fields_in_bytes": 309504623,
                "doc_values": "0b",
                "doc_values_in_bytes": 0,
                "points": "0b",
                "points_in_bytes": 0,
                "norms": "0b",
                "norms_in_bytes": 0,
                "term_vectors": "0b",
                "term_vectors_in_bytes": 0
            },
            "_primary_term": {
                "total": "0b",
                "total_in_bytes": 0,
                "inverted_index": {
                    "total": "0b",
                    "total_in_bytes": 0
                },
                "stored_fields": "0b",
                "stored_fields_in_bytes": 0,
                "doc_values": "0b",
                "doc_values_in_bytes": 0,
                "points": "0b",
                "points_in_bytes": 0,
                "norms": "0b",
                "norms_in_bytes": 0,
                "term_vectors": "0b",
                "term_vectors_in_bytes": 0
            },
            "_routing": {
                "total": "174mb",
                "total_in_bytes": 182494447,
                "inverted_index": {
                    "total": "1.6mb",
                    "total_in_bytes": 1775801
                },
                "stored_fields": "172.3mb",
                "stored_fields_in_bytes": 180718646,
                "doc_values": "0b",
                "doc_values_in_bytes": 0,
                "points": "0b",
                "points_in_bytes": 0,
                "norms": "0b",
                "norms_in_bytes": 0,
                "term_vectors": "0b",
                "term_vectors_in_bytes": 0
            },
            "_seq_no": {
                "total": "34.4mb",
                "total_in_bytes": 36087956,
                "inverted_index": {
                    "total": "0b",
                    "total_in_bytes": 0
                },
                "stored_fields": "0b",
                "stored_fields_in_bytes": 0,
                "doc_values": "15.7mb",
                "doc_values_in_bytes": 16466033,
                "points": "18.7mb",
                "points_in_bytes": 19621923,
                "norms": "0b",
                "norms_in_bytes": 0,
                "term_vectors": "0b",
                "term_vectors_in_bytes": 0
            },
            "_source": {
                "total": "60.6gb",
                "total_in_bytes": 65120112830,
                "inverted_index": {
                    "total": "0b",
                    "total_in_bytes": 0
                },
                "stored_fields": "60.6gb",
                "stored_fields_in_bytes": 65120112830,
                "doc_values": "0b",
                "doc_values_in_bytes": 0,
                "points": "0b",
                "points_in_bytes": 0,
                "norms": "0b",
                "norms_in_bytes": 0,
                "term_vectors": "0b",
                "term_vectors_in_bytes": 0
            },
            "_tombstone": {
                "total": "1kb",
                "total_in_bytes": 1046,
                "inverted_index": {
                    "total": "0b",
                    "total_in_bytes": 0
                },
                "stored_fields": "0b",
                "stored_fields_in_bytes": 0,
                "doc_values": "1kb",
                "doc_values_in_bytes": 1046,
                "points": "0b",
                "points_in_bytes": 0,
                "norms": "0b",
                "norms_in_bytes": 0,
                "term_vectors": "0b",
                "term_vectors_in_bytes": 0
            },
            "_version": {
                "total": "96kb",
                "total_in_bytes": 98353,
                "inverted_index": {
                    "total": "0b",
                    "total_in_bytes": 0
                },
                "stored_fields": "0b",
                "stored_fields_in_bytes": 0,
                "doc_values": "96kb",
                "doc_values_in_bytes": 98353,
                "points": "0b",
                "points_in_bytes": 0,
                "norms": "0b",
                "norms_in_bytes": 0,
                "term_vectors": "0b",
                "term_vectors_in_bytes": 0
            },
            "chunk_id": {
                "total": "38.6mb",
                "total_in_bytes": 40555844,
                "inverted_index": {
                    "total": "0b",
                    "total_in_bytes": 0
                },
                "stored_fields": "0b",
                "stored_fields_in_bytes": 0,
                "doc_values": "12.9mb",
                "doc_values_in_bytes": 13570698,
                "points": "25.7mb",
                "points_in_bytes": 26985146,
                "norms": "0b",
                "norms_in_bytes": 0,
                "term_vectors": "0b",
                "term_vectors_in_bytes": 0
            },
            "ctime": {
                "total": "26.3mb",
                "total_in_bytes": 27680959,
                "inverted_index": {
                    "total": "0b",
                    "total_in_bytes": 0
                },
                "stored_fields": "0b",
                "stored_fields_in_bytes": 0,
                "doc_values": "11.3mb",
                "doc_values_in_bytes": 11859476,
                "points": "15mb",
                "points_in_bytes": 15821483,
                "norms": "0b",
                "norms_in_bytes": 0,
                "term_vectors": "0b",
                "term_vectors_in_bytes": 0
            },
            "file_id": {
                "total": "13mb",
                "total_in_bytes": 13662392,
                "inverted_index": {
                    "total": "4.3mb",
                    "total_in_bytes": 4588082
                },
                "stored_fields": "0b",
                "stored_fields_in_bytes": 0,
                "doc_values": "8.6mb",
                "doc_values_in_bytes": 9074310,
                "points": "0b",
                "points_in_bytes": 0,
                "norms": "0b",
                "norms_in_bytes": 0,
                "term_vectors": "0b",
                "term_vectors_in_bytes": 0
            },
            "file_name": {
                "total": "6.5mb",
                "total_in_bytes": 6827886,
                "inverted_index": {
                    "total": "6.5mb",
                    "total_in_bytes": 6827886
                },
                "stored_fields": "0b",
                "stored_fields_in_bytes": 0,
                "doc_values": "0b",
                "doc_values_in_bytes": 0,
                "points": "0b",
                "points_in_bytes": 0,
                "norms": "0b",
                "norms_in_bytes": 0,
                "term_vectors": "0b",
                "term_vectors_in_bytes": 0
            },
            "group_id": {
                "total": "6.8mb",
                "total_in_bytes": 7155203,
                "inverted_index": {
                    "total": "1.6mb",
                    "total_in_bytes": 1776123
                },
                "stored_fields": "0b",
                "stored_fields_in_bytes": 0,
                "doc_values": "5.1mb",
                "doc_values_in_bytes": 5379080,
                "points": "0b",
                "points_in_bytes": 0,
                "norms": "0b",
                "norms_in_bytes": 0,
                "term_vectors": "0b",
                "term_vectors_in_bytes": 0
            },
            "mtime": {
                "total": "26.3mb",
                "total_in_bytes": 27681073,
                "inverted_index": {
                    "total": "0b",
                    "total_in_bytes": 0
                },
                "stored_fields": "0b",
                "stored_fields_in_bytes": 0,
                "doc_values": "11.3mb",
                "doc_values_in_bytes": 11859476,
                "points": "15mb",
                "points_in_bytes": 15821597,
                "norms": "0b",
                "norms_in_bytes": 0,
                "term_vectors": "0b",
                "term_vectors_in_bytes": 0
            },
            "vector": {
                "total": "30.1gb",
                "total_in_bytes": 32339049130,
                "inverted_index": {
                    "total": "0b",
                    "total_in_bytes": 0
                },
                "stored_fields": "0b",
                "stored_fields_in_bytes": 0,
                "doc_values": "30.1gb",
                "doc_values_in_bytes": 32339049130,
                "points": "0b",
                "points_in_bytes": 0,
                "norms": "0b",
                "norms_in_bytes": 0,
                "term_vectors": "0b",
                "term_vectors_in_bytes": 0
            }
        }
    }
}

For point 3: You have the full document in _source (by default). And on top you need to store fields in the indexed data structure. That's why you could either exclude the large dense_vector field from source or use synthetic source — though both have their tradeoffs.