Elasticsearch dense_vector is taking up too much storage space！Help

Andy_Cong · September 4, 2024, 4:13pm

I'm having an issue with Elasticsearch where my dense_vector data is taking up significantly more storage space than expected.

ES Version
8.2.3

Data Model:

vector field: Type is dense_vector with a dimension of 1024. The data is stored as a 32-bit floating-point number, with values falling within the range of -0.1 to 0.1. Theoretically, each vector should occupy 4KB,
Indexed 1000 documents, which should result in a total storage of 4MB.

Actual Storage:

Actual storage usage is 12MB.
The _source field is taking up 8MB, and I'm unsure why.

index mapping

{
  "test_vector_v1": {
    "mappings": {
      "properties": {
        "vector": {
          "type": "dense_vector",
          "dims": 1024
        }
      }
    }
  }
}

disk storage

{
    "_shards": {
        "total": 1,
        "successful": 1,
        "failed": 0
    },
    "test_vector_v1": {
        "store_size": "12mb",
        "store_size_in_bytes": 12649263,
        "all_fields": {
            "total": "12mb",
            "total_in_bytes": 12642767,
            "inverted_index": {
                "total": "16.9kb",
                "total_in_bytes": 17348
            },
            "stored_fields": "8.1mb",
            "stored_fields_in_bytes": 8522858,
            "doc_values": "3.9mb",
            "doc_values_in_bytes": 4101499,
            "points": "1kb",
            "points_in_bytes": 1062,
            "norms": "0b",
            "norms_in_bytes": 0,
            "term_vectors": "0b",
            "term_vectors_in_bytes": 0
        },
        "fields": {
            "_id": {
                "total": "35kb",
            },
            "_primary_term": {
                "total": "0b",
            },
            "_seq_no": {
                "total": "2.5kb",
            },
            "_source": {
                "total": "8.1mb",
                "total_in_bytes": 8504327,
                "inverted_index": {
                    "total": "0b",
                    "total_in_bytes": 0
                },
                "stored_fields": "8.1mb",
                "stored_fields_in_bytes": 8504327,
                "doc_values": "0b",
                "doc_values_in_bytes": 0,
                "points": "0b",
                "points_in_bytes": 0,
                "norms": "0b",
                "norms_in_bytes": 0,
                "term_vectors": "0b",
                "term_vectors_in_bytes": 0
            },
            "vector": {
                "total": "3.9mb",
                "total_in_bytes": 4100000,
                "inverted_index": {
                    "total": "0b",
                    "total_in_bytes": 0
                },
                "stored_fields": "0b",
                "stored_fields_in_bytes": 0,
                "doc_values": "3.9mb",
                "doc_values_in_bytes": 4100000,
                "points": "0b",
                "points_in_bytes": 0,
                "norms": "0b",
                "norms_in_bytes": 0,
                "term_vectors": "0b",
                "term_vectors_in_bytes": 0
            }
        }
    }
}

Optimization Attempts:

Disabled doc_values: Not supported for dense_vector.
Removed the vector field from _source: This creates a new field, disabled _source: but this creates a _recovery_source field, which also takes up 8MB.

PUT /test_vector_v2
{
  "mappings": {
    "properties": {
      "vector": { "type": "dense_vector", "dims": 1024 },
      "file_id": { "type": "keyword" }
    },
    "_source": { "includes": [ "file_id" ] }
  }
}

I'm looking for ways to reduce the storage overhead of my dense_vector data in Elasticsearch. Is it feasible to achieve a 4MB storage limit?
Thank you for your suggestion！！

dadoonet · September 4, 2024, 5:00pm

Upgrade! Upgrade! Upgrade.

We added a LOT of improvements since 8.2. 8.15.0 is your friend

Andy_Cong · September 5, 2024, 1:28am

Thank your answer，besides upgrading, what other approaches can we take to optimize Elasticsearch??

stephenb · September 5, 2024, 1:44am

Hi @Andy_Cong

Question

Are are you trying to figure out what the overall storage at scale?

... Are you trying to interpolate say you have a million of those? Is that what you're trying to figure out?

Because at small scales elastic isn't as efficient. The way the data is laid out in segments on disk can take up more room than the actual data needs.... But at scale the efficiency gets much better...
Often very small numbers of documents provide a very poor estimate of what the actual storage will be at scale... Kind of seems like what you're doing here.

So why don't you put in 100,000 or 1M of them and then run Force merge on the index down to one segment and then you'll have a much better understanding what the actual disk space required is.

That would be my suggestion...

dadoonet · September 5, 2024, 2:11am

Agreed but with a 8.15 version. So many changes happened in this space so it's useless to draw a conclusion on an old version IMO.

Read:

stephenb · September 5, 2024, 2:52am

@dadoonet Totally agree!!!..

Plus Quantization etc etc... Then I would still load 1M and Force merge to get true size

Andy_Cong · September 12, 2024, 12:51pm

Thank you very much. Is there any stress test performance comparison between these versions?

stephenb · September 12, 2024, 1:58pm

Nightly benchmarks

https://elasticsearch-benchmarks.elastic.co/#tracks/dense_vector/nightly/default/90d

Andy_Cong · September 24, 2024, 7:39am

Thank you. Very nice. Does this table show the ES version? How can I check the corresponding relationship?

Topic		Replies	Views
Understanding Storage Overhead in Elasticsearch for Vector Data Elasticsearch vector-search	3	55	August 5, 2024
Dense vectors taking up much more space than expected Elasticsearch vector-search	2	113	November 8, 2024
Performance and storage of the dense_vector type Elasticsearch	3	2537	April 22, 2021
Dense vector field space requirements Elasticsearch vector-search	3	1358	December 23, 2022
Dense_vector size Elasticsearch	1	702	March 26, 2019

Elasticsearch dense_vector is taking up too much storage space！Help

Related topics