Performance and storage of the dense_vector type

While testing the dense_vector type for 'indexing' of deep n-n embeddings, I came up with a very confusing outcome.

I have been testing the impact and performance of the following part of the mapping:

{ ...
   'embedding_raw': {'type': 'dense_vector', 'dims': 512},
}

According to the documntation:

Internally, each document’s dense vector is encoded as a binary doc value. Its size in bytes is equal to 4 * dims + 4, where dims—the number of the vector’s dimensions.

So internally that would mean that each vector is stored as a packed list of float32 types, expecting 2kB per vector.

Then I proceeded to test them with 10k sample documents. Vector values were real data (not randomly generated) yet random enough for the machine, zero centered, mostly -1<x<1.
With that the you would expect the index to take about 20MB, however it took more than 100MB.

So I started testing inserting various values for the vectors, all are vectors of 512 elements, indicating the index size after ingestion:

  • [] (nothing, control) by removing the dense_vector field and only adding other meta-data: 1.5MB
  • [0 for range(512)] : 5.6MB
  • [0.5 for range(512)]: 5.6MB
    Then I thought maybe vectors are stored plain-text, not as binary data
  • [e/pi for range(512)]: 5.7MB
    Doesn't seem like, so I started to feed actual vector data
  • [round(x,1) for x in data]: 24.6MB
  • [round(x,2) for x in data]: 31.6MB
  • [round(x,4) for x in data]: 42MB
  • [float(round(4*x))/4.0 for x in data]: 6MB (mostly 0'es)
  • [sign(x) for x in data]: 18MB
  • [x for x in data]: 102MB

What confused me is that the documents indexed this way take significantly more space when you provide more precision. If they are really stored as binary data, then the rest needs to be the spatial indices to speed up brute force nearest neighbour search, yet according to the documentation, the only use for dense vectors is rescore them with script_score, which by definition cannot use indices.

Is there any 'clever' spatial indexing going on in the background for search? How to use it with scoring? Or maybe unlike the documentation, documents are just stored plain text and dense_vectors is just a waste of space?
According to this:

There is work done on this topic as we speak, and if dense vectors are indeed building some sort of a spatial index that would explain increasing space needed for the documents, yet documentation still only provides examples of brute-force search using scripts.

Any insights about that would be greatly appreciate.

Anyone had some experience with dense vectors and their behaviour?

@Jacek_Wolkowicz Sorry for a late reply, I hope it could still be useful.

We do store a dense_vector as binary doc value with size 4*dims+4 . But this is size before compression. When we store document values on disk, we compress them. And compressed size depends on actual data (where repeated data can be compressed much better).

We've removed this statement about size from our documentation in later versions so not to confuse users.

1 Like