Populate a dense vector field for only a subset of the documents

sbruinsje · August 8, 2023, 6:36am

I have an index of over 10 million documents. On those documents, I want to store an openai (or similar) embedding vector using an indexed dense vector field. I will be using cosine similarity to search through those vectors combined with filters on other fields. However, the vectors take up quite some space, storing an embedding vector for each of the more than 10 million documents will at least triple my index size. Is there a way to set a value for only part of the documents?

I know that there is the "null_value" which i've seen used for keyword fields. I don't know if its supported for dense_vector fieldsthough, but if it is I wonder what would be a good value to be used as null value for a dense_vector field without it affecting the search results? Documents with the null_value shouldn't match the cosine similarity search.

My ultimate goal is to be able to a normal keyword search combined on all documents like I normally do without using vector search, but in some cases do a cosine similarity search combined with filters on the other fields, which should only return documents that have an actual vector value for the dense_vector field. Only documents added in the last 3 months will get a vector value, old documents will have their vectors removed.

BenTrent · August 8, 2023, 2:09pm

Heya @sbruinsje,

You can do the following to not index a vector for a document:

Not include the field value at all for the field
add the field value as null specifically (supported as of 8.7)

In both of these causes, the vector isn't searchable and won't be considered for brute force nor approximate nearest neighbor search.

Does this answer your question?

sbruinsje · August 9, 2023, 8:05am

Thanks @BenTrent!

Oh cool, there is an explicit null value supported now

So if I understand correctly, I can set the value of the dense_vector field to null for some documents causing those documents to never be included in a knn search, yet the documents with an actual vector value for the dense_vector field will be considered for the knn search?

How does it interact with hybrid search? Lets say some documents are scored by the regular keyword search which do not have a vector value for the dense_vector field. How is the hybrid scoring dealing with that?

BenTrent · August 9, 2023, 11:45am

Hybrid search in Elasticsearch is currently only an "or" combination. Having documents that don't overlap between the two search kinds (BM25 & nearest vectors) is common and is perfectly fine. A document that is scored via BM25, but not in kNN, will only have its BM25 score.

sbruinsje · August 9, 2023, 12:56pm

Perfect that answers my questions. Thanks!

system · September 6, 2023, 12:57pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Bug with displaying a NULL value in dense_vector field in elasticsearch web UI Elasticsearch elastic-stack-monitoring , vector-search	0	24	February 16, 2025
Error cosine similarity Elasticsearch	1	646	March 9, 2021
Knn_vectors field understanding Elasticsearch vector-search	23	121	March 6, 2025
Dense vector search using script_score Elasticsearch vector-search	3	1092	April 24, 2023
Dense vector field space requirements Elasticsearch vector-search	3	1445	December 23, 2022

Populate a dense vector field for only a subset of the documents

Related topics