Dense search for large documents

mwon · December 13, 2023, 11:20am

Hi,

I want to use ES to index documents and do semantic search with knn. For this type of search we need to encode every document with an embedding model and index each vector for future search of some also encoded query.

I already have a working solution but the issue is that the embedding model is limited to the number of words, and so my current solution works only for short documents.

This is a know problem and from what I read here, ES 8.11 is already prepared to index multiple vectors per document. The ideia is to split the document into smaller chunks and encode one vector per chunk.

My question is about implementation. I cannot replicate what they show in the blog post, because I want to adapt my code (in python). In my code I run the model locally and do "myself" the chunking.

What I'm looking for is how can I index each chunk vector to my doc. I can't find any documentation on how to do it via the API.

Any help is appreciated. Thanks

BenTrent · December 13, 2023, 12:42pm

@mwon ,

You can either index each passage as its own document or you can use nested mappings.

PUT chunker
{
  "mappings": {
    "dynamic": "true",
    "properties": {
      "passages": {
        "type": "nested",
        "properties": {
          "vector": {
            "type": "dense_vector",
            "index": true,
            "dims": 384,
            "similarity": "dot_product"
          }
        }
      }
    }
  }
}

Then, for your index requests it would look something like:

{"nested": [{"vector: [...]}, {"vector": [...]}]}

Here are some docs: k-nearest neighbor (kNN) search | Elasticsearch Guide [8.11] | Elastic

joemcelroy · December 13, 2023, 12:54pm

theres also a notebook example for the article here which may help https://github.com/elastic/elasticsearch-labs/blob/main/notebooks/document-chunking/with-index-pipelines.ipynb

mwon · December 13, 2023, 1:50pm

Thanks @BenTrent . It's is exactly what I was looking for.
In respect to the knn search itself, do you know if the score calculation takes into account the fact that some vectors have origin in the same document, or in the end is just a normal knn search on all vectors, and just returns the top from each different document?

BenTrent · December 13, 2023, 2:10pm

To quote the docs:

kNN search over nested dense_vectors will always diversify the top results over the top-level document. Meaning, "k" top-level documents will be returned, scored by their nearest passage vector

So, its scored by nearest-passage and we will keep exploring the graph until we get k total docs, not k total passages.

system · January 10, 2024, 2:11pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch Hybrid Query - No Results Elasticsearch	2	796	February 18, 2021
Semantic search with the new semantic_text field Elasticsearch elastic-stack-machine-learning , vector-search	12	489	March 21, 2025
Best method for calculating text embedding for a KNN search? Elasticsearch elastic-stack-machine-learning , runtime-fields , vector-search	2	680	May 2, 2023
Retrieving top N hits from nested documents across all matching documents Elasticsearch vector-search	1	453	June 12, 2023
Guidance on Semantic Search Implementation with Vector Embeddings Elastic Search	2	75	January 7, 2025

Dense search for large documents

Related topics