Hi, I have defined one field, named "file_section_embedding", in my index mapping to be dense_vector
and enabled index for it.
I also tested both including and excluding the dense vector field in/from the "_source", and both option gives me the same amount of storage size in this field, which is about 49.6gb. I have 10376000 documents, and therefore 10376000 vectors, which aligns roughly based on the following calculation with the default int8_hnsw quantization.
> float((10376000*1024*4)/(1024**3)) ~= 39.59 GB
> 39.59*0.25 ~= 9.9 GB
But I have the following question about the above observation
- I saw that after excluding the "file_section_embedding" field from
_source
, there is an obvious drop of storage size in the_source
field, from 163.5gb to 5.7gb, which I think was due to not keeping the raw vectors in the_source
field, right? But what is the structure to keep the dense vectors in_source
that consumes so much disk space? - I feel a bit strange that the storage size of the "file_section_embedding" field is the same including or excluding from
_source
, which leaves me the impression that the raw vector values are still kept in the field. I tested it out with the following query against the index which excluded it from_source
POST https://127.0.0.1:9200/file_flat_1024_exclude_vec_3/_search
{
"size": 5,
"_source": false,
"script_fields": {
"raw_vector": {
"script": {
"source": "doc['file_section_embedding'].vectorValue"
}
}
},
"query": {
"match_all": {}
}
}
I also read the official documentation from Elasticsearch that the raw vector values are kept Dense vector field type | Elasticsearch Guide [8.17] | Elastic
Quantization will continue to keep the raw float vector values on disk for reranking, reindexing, and quantization improvements over the lifetime of the data. This means disk usage will increase by ~25% for
int8
, ~12.5% forint4
, and ~3.1% forbbq
due to the overhead of storing the quantized and raw vectors.
2.1 Does the reranking happened automatically when we index new documents?
2.2 What the quantization improvements include?
2.3 Does it mean that we could still reindex even excluding dense vector field from _source
?
2.4 Does it mean that we could still use the raw vector values for rescore? 2.5 Could we export all the raw vectors from the index which contains large amount of vectors? like 40GB or even more in our case.
Thanks a lot.