While testing the dense_vector type for 'indexing' of deep nn embeddings, I came up with a very confusing outcome.
I have been testing the impact and performance of the following part of the mapping:
{ ...
'embedding_raw': {'type': 'dense_vector', 'dims': 512},
}
According to the documntation:
Internally, each document’s dense vector is encoded as a binary doc value. Its size in bytes is equal to 4 * dims + 4, where dims—the number of the vector’s dimensions.
So internally that would mean that each vector is stored as a packed list of float32 types, expecting 2kB per vector.
Then I proceeded to test them with 10k sample documents. Vector values were real data (not randomly generated) yet random enough for the machine, zero centered, mostly 1<x<1
.
With that the you would expect the index to take about 20MB, however it took more than 100MB.
So I started testing inserting various values for the vectors, all are vectors of 512 elements, indicating the index size after ingestion:

[]
(nothing, control) by removing the dense_vector field and only adding other metadata: 1.5MB 
[0 for range(512)]
: 5.6MB 
[0.5 for range(512)]
: 5.6MB
Then I thought maybe vectors are stored plaintext, not as binary data 
[e/pi for range(512)]
: 5.7MB
Doesn't seem like, so I started to feed actual vector data 
[round(x,1) for x in data]
: 24.6MB 
[round(x,2) for x in data]
: 31.6MB 
[round(x,4) for x in data]
: 42MB 
[float(round(4*x))/4.0 for x in data]
: 6MB (mostly 0'es) 
[sign(x) for x in data]
: 18MB 
[x for x in data]
: 102MB
What confused me is that the documents indexed this way take significantly more space when you provide more precision. If they are really stored as binary data, then the rest needs to be the spatial indices to speed up brute force nearest neighbour search, yet according to the documentation, the only use for dense vectors is rescore them with script_score, which by definition cannot use indices.
Is there any 'clever' spatial indexing going on in the background for search? How to use it with scoring? Or maybe unlike the documentation, documents are just stored plain text and dense_vectors is just a waste of space?
According to this:
There is work done on this topic as we speak, and if dense vectors are indeed building some sort of a spatial index that would explain increasing space needed for the documents, yet documentation still only provides examples of bruteforce search using scripts.
Any insights about that would be greatly appreciate.