I have an index of over 10 million documents. On those documents, I want to store an openai (or similar) embedding vector using an indexed dense vector field. I will be using cosine similarity to search through those vectors combined with filters on other fields. However, the vectors take up quite some space, storing an embedding vector for each of the more than 10 million documents will at least triple my index size. Is there a way to set a value for only part of the documents?
I know that there is the "null_value" which i've seen used for keyword fields. I don't know if its supported for dense_vector fieldsthough, but if it is I wonder what would be a good value to be used as null value for a dense_vector field without it affecting the search results? Documents with the null_value shouldn't match the cosine similarity search.
My ultimate goal is to be able to a normal keyword search combined on all documents like I normally do without using vector search, but in some cases do a cosine similarity search combined with filters on the other fields, which should only return documents that have an actual vector value for the dense_vector field. Only documents added in the last 3 months will get a vector value, old documents will have their vectors removed.
Oh cool, there is an explicit null value supported now
So if I understand correctly, I can set the value of the dense_vector field to null for some documents causing those documents to never be included in a knn search, yet the documents with an actual vector value for the dense_vector field will be considered for the knn search?
How does it interact with hybrid search? Lets say some documents are scored by the regular keyword search which do not have a vector value for the dense_vector field. How is the hybrid scoring dealing with that?
Hybrid search in Elasticsearch is currently only an "or" combination. Having documents that don't overlap between the two search kinds (BM25 & nearest vectors) is common and is perfectly fine. A document that is scored via BM25, but not in kNN, will only have its BM25 score.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.