Elasticsearch dense_vector is taking up too much storage space!Help

I'm having an issue with Elasticsearch where my dense_vector data is taking up significantly more storage space than expected.

ES Version
8.2.3

Data Model:

  • vector field: Type is dense_vector with a dimension of 1024. The data is stored as a 32-bit floating-point number, with values falling within the range of -0.1 to 0.1. Theoretically, each vector should occupy 4KB,
  • Indexed 1000 documents, which should result in a total storage of 4MB.

Actual Storage:

  • Actual storage usage is 12MB.
  • The _source field is taking up 8MB, and I'm unsure why.

index mapping

{
  "test_vector_v1": {
    "mappings": {
      "properties": {
        "vector": {
          "type": "dense_vector",
          "dims": 1024
        }
      }
    }
  }
}

disk storage

{
    "_shards": {
        "total": 1,
        "successful": 1,
        "failed": 0
    },
    "test_vector_v1": {
        "store_size": "12mb",
        "store_size_in_bytes": 12649263,
        "all_fields": {
            "total": "12mb",
            "total_in_bytes": 12642767,
            "inverted_index": {
                "total": "16.9kb",
                "total_in_bytes": 17348
            },
            "stored_fields": "8.1mb",
            "stored_fields_in_bytes": 8522858,
            "doc_values": "3.9mb",
            "doc_values_in_bytes": 4101499,
            "points": "1kb",
            "points_in_bytes": 1062,
            "norms": "0b",
            "norms_in_bytes": 0,
            "term_vectors": "0b",
            "term_vectors_in_bytes": 0
        },
        "fields": {
            "_id": {
                "total": "35kb",
            },
            "_primary_term": {
                "total": "0b",
            },
            "_seq_no": {
                "total": "2.5kb",
            },
            "_source": {
                "total": "8.1mb",
                "total_in_bytes": 8504327,
                "inverted_index": {
                    "total": "0b",
                    "total_in_bytes": 0
                },
                "stored_fields": "8.1mb",
                "stored_fields_in_bytes": 8504327,
                "doc_values": "0b",
                "doc_values_in_bytes": 0,
                "points": "0b",
                "points_in_bytes": 0,
                "norms": "0b",
                "norms_in_bytes": 0,
                "term_vectors": "0b",
                "term_vectors_in_bytes": 0
            },
            "vector": {
                "total": "3.9mb",
                "total_in_bytes": 4100000,
                "inverted_index": {
                    "total": "0b",
                    "total_in_bytes": 0
                },
                "stored_fields": "0b",
                "stored_fields_in_bytes": 0,
                "doc_values": "3.9mb",
                "doc_values_in_bytes": 4100000,
                "points": "0b",
                "points_in_bytes": 0,
                "norms": "0b",
                "norms_in_bytes": 0,
                "term_vectors": "0b",
                "term_vectors_in_bytes": 0
            }
        }
    }
}

Optimization Attempts:

  • Disabled doc_values: Not supported for dense_vector.
  • Removed the vector field from _source: This creates a new field, disabled _source: but this creates a _recovery_source field, which also takes up 8MB.
PUT /test_vector_v2
{
  "mappings": {
    "properties": {
      "vector": { "type": "dense_vector", "dims": 1024 },
      "file_id": { "type": "keyword" }
    },
    "_source": { "includes": [ "file_id" ] }
  }
}

I'm looking for ways to reduce the storage overhead of my dense_vector data in Elasticsearch. Is it feasible to achieve a 4MB storage limit?
Thank you for your suggestion!!

Upgrade! Upgrade! Upgrade.

We added a LOT of improvements since 8.2. 8.15.0 is your friend :wink:

1 Like

Thank your answer,besides upgrading, what other approaches can we take to optimize Elasticsearch??

Hi @Andy_Cong

Question

Are are you trying to figure out what the overall storage at scale?

... Are you trying to interpolate say you have a million of those? Is that what you're trying to figure out?

Because at small scales elastic isn't as efficient. The way the data is laid out in segments on disk can take up more room than the actual data needs.... But at scale the efficiency gets much better...
Often very small numbers of documents provide a very poor estimate of what the actual storage will be at scale... Kind of seems like what you're doing here.

So why don't you put in 100,000 or 1M of them and then run Force merge on the index down to one segment and then you'll have a much better understanding what the actual disk space required is.

That would be my suggestion...

1 Like

Agreed but with a 8.15 version. So many changes happened in this space so it's useless to draw a conclusion on an old version IMO.

Read:

6 Likes

@dadoonet Totally agree!!!..

Plus Quantization etc etc... Then I would still load 1M and Force merge to get true size

1 Like

Thank you very much. Is there any stress test performance comparison between these versions?

Nightly benchmarks

https://elasticsearch-benchmarks.elastic.co/#tracks/dense_vector/nightly/default/90d

1 Like

Thank you. Very nice. Does this table show the ES version? How can I check the corresponding relationship?