Dense search for large documents

Hi,

I want to use ES to index documents and do semantic search with knn. For this type of search we need to encode every document with an embedding model and index each vector for future search of some also encoded query.

I already have a working solution but the issue is that the embedding model is limited to the number of words, and so my current solution works only for short documents.

This is a know problem and from what I read here, ES 8.11 is already prepared to index multiple vectors per document. The ideia is to split the document into smaller chunks and encode one vector per chunk.

My question is about implementation. I cannot replicate what they show in the blog post, because I want to adapt my code (in python). In my code I run the model locally and do "myself" the chunking.

What I'm looking for is how can I index each chunk vector to my doc. I can't find any documentation on how to do it via the API.

Any help is appreciated. Thanks

@mwon ,

You can either index each passage as its own document or you can use nested mappings.

PUT chunker
{
  "mappings": {
    "dynamic": "true",
    "properties": {
      "passages": {
        "type": "nested",
        "properties": {
          "vector": {
            "type": "dense_vector",
            "index": true,
            "dims": 384,
            "similarity": "dot_product"
          }
        }
      }
    }
  }
}

Then, for your index requests it would look something like:

{"nested": [{"vector: [...]}, {"vector": [...]}]}

Here are some docs: k-nearest neighbor (kNN) search | Elasticsearch Guide [8.11] | Elastic

1 Like

theres also a notebook example for the article here which may help :slight_smile: https://github.com/elastic/elasticsearch-labs/blob/main/notebooks/document-chunking/with-index-pipelines.ipynb

1 Like

Thanks @BenTrent . It's is exactly what I was looking for.
In respect to the knn search itself, do you know if the score calculation takes into account the fact that some vectors have origin in the same document, or in the end is just a normal knn search on all vectors, and just returns the top from each different document?

To quote the docs:

kNN search over nested dense_vectors will always diversify the top results over the top-level document. Meaning, "k" top-level documents will be returned, scored by their nearest passage vector

So, its scored by nearest-passage and we will keep exploring the graph until we get k total docs, not k total passages.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.