Performance and storage of the dense_vector type

Jacek_Wolkowicz · March 1, 2021, 7:20pm

While testing the dense_vector type for 'indexing' of deep n-n embeddings, I came up with a very confusing outcome.

I have been testing the impact and performance of the following part of the mapping:

{ ...
   'embedding_raw': {'type': 'dense_vector', 'dims': 512},
}

According to the documntation:

Internally, each document’s dense vector is encoded as a binary doc value. Its size in bytes is equal to 4 * dims + 4, where dims—the number of the vector’s dimensions.

So internally that would mean that each vector is stored as a packed list of float32 types, expecting 2kB per vector.

Then I proceeded to test them with 10k sample documents. Vector values were real data (not randomly generated) yet random enough for the machine, zero centered, mostly -1<x<1.
With that the you would expect the index to take about 20MB, however it took more than 100MB.

So I started testing inserting various values for the vectors, all are vectors of 512 elements, indicating the index size after ingestion:

[] (nothing, control) by removing the dense_vector field and only adding other meta-data: 1.5MB
[0 for range(512)] : 5.6MB
[0.5 for range(512)]: 5.6MB
Then I thought maybe vectors are stored plain-text, not as binary data
[e/pi for range(512)]: 5.7MB
Doesn't seem like, so I started to feed actual vector data
[round(x,1) for x in data]: 24.6MB
[round(x,2) for x in data]: 31.6MB
[round(x,4) for x in data]: 42MB
[float(round(4*x))/4.0 for x in data]: 6MB (mostly 0'es)
[sign(x) for x in data]: 18MB
[x for x in data]: 102MB

What confused me is that the documents indexed this way take significantly more space when you provide more precision. If they are really stored as binary data, then the rest needs to be the spatial indices to speed up brute force nearest neighbour search, yet according to the documentation, the only use for dense vectors is rescore them with script_score, which by definition cannot use indices.

Is there any 'clever' spatial indexing going on in the background for search? How to use it with scoring? Or maybe unlike the documentation, documents are just stored plain text and dense_vectors is just a waste of space?
According to this:

There is work done on this topic as we speak, and if dense vectors are indeed building some sort of a spatial index that would explain increasing space needed for the documents, yet documentation still only provides examples of brute-force search using scripts.

Any insights about that would be greatly appreciate.

Jacek_Wolkowicz · March 2, 2021, 4:56pm

Anyone had some experience with dense vectors and their behaviour?

mayya · March 25, 2021, 8:20pm

@Jacek_Wolkowicz Sorry for a late reply, I hope it could still be useful.

We do store a dense_vector as binary doc value with size 4*dims+4 . But this is size before compression. When we store document values on disk, we compress them. And compressed size depends on actual data (where repeated data can be compressed much better).

We've removed this statement about size from our documentation in later versions so not to confuse users.

system · April 22, 2021, 8:21pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Dense vectors taking up much more space than expected Elasticsearch vector-search	2	204	November 8, 2024
Elasticsearch dense_vector is taking up too much storage space！Help Elasticsearch vector-search	8	225	September 24, 2024
Dense vector field space requirements Elasticsearch vector-search	3	1447	December 23, 2022
Knn_vectors field understanding Elasticsearch vector-search	23	128	March 6, 2025
How to exclude dense_vector field from being stored Elasticsearch vector-search	7	1233	December 29, 2022

Performance and storage of the dense_vector type

Related topics