How can I exclude dense_vector field from being stored in the _source?
I ran an experiment indexing approximately 6_000_000 documents and here is what I found out after running
curl --location --request POST 'http://127.0.0.1:9200/index_name.vec/_disk_usage?run_expensive_tasks=true
The dense_vector field indeed takes approximately 4 Kb in the index: (4 × 1024 + 4) / 1024 / 1024 / 1024 × (5872381 + 777907) = 25,3 Gb
which is confirmed by: "knn_vectors": "26.2gb"
But it is also storing dense vectors as raw floats without any optimization in the _source field, because if I save a plain text file with 1024 floats I get approximately 21 Kb file size: 21 / 1024 / 1024 × (5872381 + 777907) = 133 Gb
which is somewhat confirmed by "stored_fields": "107.9gb"
and I can also see the vectors in the search output.
So how do I exclude dense_vector from _source, since I do not need this field in the document representation anyway?
I tried setting:
I just don't understand what is going on with Elastic!
I created two indexes with the exact same mappings except just one parameter where one mapping had the dense_vector excluded from the _source and the other did not:
Index with vectors in source is storing dense_vector as plain floats in the source as expected
Index with no vectors in source does not store dense vectors BUT it creates a new field called _recovery_source with the size equal to what 1000 1024-dim vectors stored as plain floats would occupy.
So even though I explicitly excluded dense vectors from being stored in Elastic they are still stored just in a new field!
@ruslaniv could you try POST no_vector_in_source/_forcemerge ? That may remove it.
Another option is to try synthetic source: _source field | Elasticsearch Guide [8.5] | Elastic But that is currently is an "all or nothing" deal. I am not sure if it runs into the _recovery_source issue or not.
Unfortunately synthetic source is not an option since we extensively use flattened field which is not supported by synthetic source;
I did try running _forcemerge on that index but it did not resolve the problem, the _recovery_source field is still present in the analysis results and the index still takes the same amount of space
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.