Hi,
We are going to be storing embeddings for upwards of 100s of millions of documents, and so every bit of storage counts (as things get quite expensive at this scale).
We are using a 1024 size dense vector field and I saw that if we allow the vectors in _source field, then the storage size blows up to a ridiculous degree.
For example, I have an index I was testing with approximately 300k documents. That amount of documents should result in somewhere around 1.2 Gb of storage for embeddings.
The size of the index grew by around 5gb when I put embeddings on all of those documents.
I did some digging and I ran
POST {index}/_disk_usage?run_expensive_tasks=true
This gave me the exact expected result for the size of dense vectors
"textVectors1024.vector": { -
"total": "1.2gb",
"total_in_bytes": 1353195311,
"inverted_index": { -
"total": "0b",
"total_in_bytes": 0
},
"stored_fields": "0b",
"stored_fields_in_bytes": 0,
"doc_values": "0b",
"doc_values_in_bytes": 0,
"points": "0b",
"points_in_bytes": 0,
"norms": "0b",
"norms_in_bytes": 0,
"term_vectors": "0b",
"term_vectors_in_bytes": 0,
"knn_vectors": "1.2gb",
"knn_vectors_in_bytes": 1353195311
}
about 1.2 Gb of storage.
When I looked at the _source field, it grew from around 2Gb (before embeddings) to 6Gb (after embeddings)!
"_source": {
"total": "6gb",
"total_in_bytes": 6514271224,
"inverted_index": {
"total": "0b",
"total_in_bytes": 0
},
"stored_fields": "6gb",
"stored_fields_in_bytes": 6514271224,
"doc_values": "0b",
"doc_values_in_bytes": 0,
"points": "0b",
"points_in_bytes": 0,
"norms": "0b",
"norms_in_bytes": 0,
"term_vectors": "0b",
"term_vectors_in_bytes": 0,
"knn_vectors": "0b",
"knn_vectors_in_bytes": 0
},
That is untenable for our scale and purposes, as using nearly 4x the expected storage space is just unworkable.
As a workaround, we decided to use the exclude from source functionality to remove our vectors from source.
This solves the space issue and it only used the 1.2 Gb as expected. However, it opens us to other issues:
- Re-indexing is a pain now. We need to regenerate embeddings every time we re-index.
- I just discovered update_by_query also removes the embeddings, which is a much more common operation for us than re-indexing.
The mapping of the dense vector field is nested, as some documents needed multiple embeddings
"textVectors1024": { -
"type": "nested",
"properties": { -
"vector": { -
"type": "dense_vector",
"dims": 1024,
"index": true,
"similarity": "cosine"
}
}
},
We have experimented with all permutations of 'index' and 'store' as true/false in the mapping. Nothing had any bearing whatsoever on the storage space. Only excluding from source got us back to saner storage space usage.
Is there anything at all that we can do other than deal with the headache of needing to regenerate embeddings any time we need to update the data?
We are on version
8.12.2
I perused the release notes over newer versions, and nothing jumped out at me that would fix this, but I would love to be wrong about about that.
Thanks!