Retrieving top N hits from nested documents across all matching documents

Hello,

I'm currently working with an Elasticsearch index where each document contains a nested field embedingContent representing "chunks" of the document. Each chunk has its own vector embedding, and I want to perform a vector similarity search across these chunks.

Here's a sample of the mapping for the embedingContent field:

"embedingContent" : {
    "type" : "nested",
    "properties" : {
        "content" : {
            "type" : "text",
            "fields" : {
                "keyword" : {
                    "type" : "keyword",
                    "ignore_above" : 256
                }
            }
        },
        "contentTokens" : {
            "type" : "long"
        },
        "embeddedString" : {
            "type" : "text",
            "fields" : {
                "keyword" : {
                    "type" : "keyword",
                    "ignore_above" : 256
                }
            }
        },
        "embedding" : {
            "type" : "dense_vector",
            "dims" : 1536
        },
        "id" : {
            "type" : "long"
        },
        "newsArticleId" : {
            "type" : "long"
        },
        "parentId" : {
            "type" : "long"
        },
        "splitId" : {
            "type" : "long"
        }
    }
}```

I want to run a query that retrieves the top 20 most relevant chunks across all documents, based on the cosine similarity of their embeddings to a query vector. However, I'm finding it difficult to do this because the `size` parameter in `inner_hits` only limits the number of chunks per document, not the total number of chunks across all documents.

Here's the Elasticsearch query I'm currently using:

```{
    "query": {
        "nested": {
            "path": "embedingContent",
            "query": {
                "script_score": {
                    "query": {"match_all": {}},
                    "script": {
                        "source": "cosineSimilarity(params.query_vector, 'embedingContent.embedding') + 1.0",
                        "params": {"query_vector": [0.1, 0.2, 0.3, ...]}  // Example query vector
                    }
                }
            },
            "inner_hits": {
                "size": 20
            }
        }
    },
    "_source": false
}```

Does anyone know of a way to limit the total number of inner hits (chunks) returned across all documents? Any help would be greatly appreciated.

---

Please replace `[0.1, 0.2, 0.3, ...]` with a representative example of your query vector, and feel free to modify the question to better fit your specific situation.

Hope this helps, and best of luck with your question!
1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.