Dense vectors taking up much more space than expected

Hi,

We are going to be storing embeddings for upwards of 100s of millions of documents, and so every bit of storage counts (as things get quite expensive at this scale).

We are using a 1024 size dense vector field and I saw that if we allow the vectors in _source field, then the storage size blows up to a ridiculous degree.

For example, I have an index I was testing with approximately 300k documents. That amount of documents should result in somewhere around 1.2 Gb of storage for embeddings.

The size of the index grew by around 5gb when I put embeddings on all of those documents.

I did some digging and I ran
POST {index}/_disk_usage?run_expensive_tasks=true

This gave me the exact expected result for the size of dense vectors

"textVectors1024.vector": { -
        "total": "1.2gb",
        "total_in_bytes": 1353195311,
        "inverted_index": { -
          "total": "0b",
          "total_in_bytes": 0
        },
        "stored_fields": "0b",
        "stored_fields_in_bytes": 0,
        "doc_values": "0b",
        "doc_values_in_bytes": 0,
        "points": "0b",
        "points_in_bytes": 0,
        "norms": "0b",
        "norms_in_bytes": 0,
        "term_vectors": "0b",
        "term_vectors_in_bytes": 0,
        "knn_vectors": "1.2gb",
        "knn_vectors_in_bytes": 1353195311
      }

about 1.2 Gb of storage.

When I looked at the _source field, it grew from around 2Gb (before embeddings) to 6Gb (after embeddings)!

"_source": {
                "total": "6gb",
                "total_in_bytes": 6514271224,
                "inverted_index": {
                    "total": "0b",
                    "total_in_bytes": 0
                },
                "stored_fields": "6gb",
                "stored_fields_in_bytes": 6514271224,
                "doc_values": "0b",
                "doc_values_in_bytes": 0,
                "points": "0b",
                "points_in_bytes": 0,
                "norms": "0b",
                "norms_in_bytes": 0,
                "term_vectors": "0b",
                "term_vectors_in_bytes": 0,
                "knn_vectors": "0b",
                "knn_vectors_in_bytes": 0
            },

That is untenable for our scale and purposes, as using nearly 4x the expected storage space is just unworkable.

As a workaround, we decided to use the exclude from source functionality to remove our vectors from source.

This solves the space issue and it only used the 1.2 Gb as expected. However, it opens us to other issues:

  1. Re-indexing is a pain now. We need to regenerate embeddings every time we re-index.
  2. I just discovered update_by_query also removes the embeddings, which is a much more common operation for us than re-indexing.

The mapping of the dense vector field is nested, as some documents needed multiple embeddings

"textVectors1024": { - 
          "type": "nested",
          "properties": { - 
            "vector": { - 
              "type": "dense_vector",
              "dims": 1024,
              "index": true,
              "similarity": "cosine"
            }
          }
        },

We have experimented with all permutations of 'index' and 'store' as true/false in the mapping. Nothing had any bearing whatsoever on the storage space. Only excluding from source got us back to saner storage space usage.

Is there anything at all that we can do other than deal with the headache of needing to regenerate embeddings any time we need to update the data?

We are on version
8.12.2

I perused the release notes over newer versions, and nothing jumped out at me that would fix this, but I would love to be wrong about about that.

Thanks!

@Justin_Porter The numbers you are reporting line up with roughly what I would expect. Including the source mapping will incur the overhead of storing all of that data, minimally duplicating all of your dense vectors.

And you've done the leg work here to show that a lot of the overhead you are experiencing comes from _source.

This is partially just a current state of what we can do in Elasticsearch as in storage for efficient vector comparison winds up being organized very differently than data stored for _source and the metadata around it hence the duplication.

A couple of things stand out to me here.

One is that we have a newer feature that is not GA'd yet and may be a paid for feature in the future but that likely would solve your problem here. It's synthetic source. It reconstructs the source on demand as needed for operations like reindexing without the need to store it. https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html#synthetic-source

Here's the example of using it from the docs:

PUT idx
{
  "mappings": {
    "_source": {
      "mode": "synthetic"
    }
  }
}

The feature is available in 8.12.x. It was introduced in 8.4. And I can tell you just by browsing the list of closed bugs that a variety of things have been fixed since 8.12.x (up to the latest 8.15.x). But it's not clear how many of those would actually bite you, particularly just for an evaluation. So should work to try out on your current version.

My gut reaction is give it a shot and then see if it meets your needs and then you can debate whether it's worthwhile to consider a non-GA feature (albeit one that is fairly mature at this point). If it helps sway you at all we've started using synthetic source in other parts of the stack by default now (tech preview as well, but nonetheless we feel confident in it).

Barring that I think you are correct you'd need to store separately or regenerate your embeddings as needed for reindex.

One other things that stood out to me since you didn't mention it is from a cost stand point relative to storage I would expect RAM to potentially be an important part of your evaluations. And we can discuss that further as well if it would help. But a lot of times this winds up being a bigger bottleneck than disk since you need enough heap to store the vector search graph (although we have some cool stuff we are working on related to that too). But suffice I'd be curious about how that's going for you too.

Happy to help you further here and kind of go back and forth and dig in a bit if it would help. Particularly if you want to walk through your specific deployment / hardware setup a bit I might be able to talk through options there too.

@john-wagster Thank you very much for your detailed reply!

I will experiment with the synthetic source and see if any issues arise from it.

As to the RAM, so far we haven't had too many issues.
We use hybrid vector searches, so the documents are filtered to a much smaller subset with basic elasticsearch queries.
At the absolute worst case, we are looking at a couple of a million documents, usually it is far far less.
At the extremes, we limit how many documents we actually run the vector search over with

{
 "script": {
     "source": "Math.random() < params.percentage",
      "params": {
            "percentage": percentage
      }
   }
}

and we can tune that percentage to a value our cluster can handle.
We're aiming to get the count of documents that are relevant to a given piece of text, so we can extrapolate the actual count based on the percentage. (Obviously an approximation, but good enough for our purposes)

I'll let you know how the synthetic source experiments go.

Thanks a lot!