Populate a dense vector field for only a subset of the documents

I have an index of over 10 million documents. On those documents, I want to store an openai (or similar) embedding vector using an indexed dense vector field. I will be using cosine similarity to search through those vectors combined with filters on other fields. However, the vectors take up quite some space, storing an embedding vector for each of the more than 10 million documents will at least triple my index size. Is there a way to set a value for only part of the documents?

I know that there is the "null_value" which i've seen used for keyword fields. I don't know if its supported for dense_vector fieldsthough, but if it is I wonder what would be a good value to be used as null value for a dense_vector field without it affecting the search results? Documents with the null_value shouldn't match the cosine similarity search.

My ultimate goal is to be able to a normal keyword search combined on all documents like I normally do without using vector search, but in some cases do a cosine similarity search combined with filters on the other fields, which should only return documents that have an actual vector value for the dense_vector field. Only documents added in the last 3 months will get a vector value, old documents will have their vectors removed.

Heya @sbruinsje,

You can do the following to not index a vector for a document:

  • Not include the field value at all for the field
  • add the field value as null specifically (supported as of 8.7)

In both of these causes, the vector isn't searchable and won't be considered for brute force nor approximate nearest neighbor search.

Does this answer your question?

Thanks @BenTrent!

Oh cool, there is an explicit null value supported now :slight_smile:

So if I understand correctly, I can set the value of the dense_vector field to null for some documents causing those documents to never be included in a knn search, yet the documents with an actual vector value for the dense_vector field will be considered for the knn search?

How does it interact with hybrid search? Lets say some documents are scored by the regular keyword search which do not have a vector value for the dense_vector field. How is the hybrid scoring dealing with that?

Hybrid search in Elasticsearch is currently only an "or" combination. Having documents that don't overlap between the two search kinds (BM25 & nearest vectors) is common and is perfectly fine. A document that is scored via BM25, but not in kNN, will only have its BM25 score.

Perfect that answers my questions. Thanks!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.