Hello,
I'm currently working with an Elasticsearch index where each document contains a nested field embedingContent
representing "chunks" of the document. Each chunk has its own vector embedding, and I want to perform a vector similarity search across these chunks.
Here's a sample of the mapping for the embedingContent field:
"embedingContent" : {
"type" : "nested",
"properties" : {
"content" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"contentTokens" : {
"type" : "long"
},
"embeddedString" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"embedding" : {
"type" : "dense_vector",
"dims" : 1536
},
"id" : {
"type" : "long"
},
"newsArticleId" : {
"type" : "long"
},
"parentId" : {
"type" : "long"
},
"splitId" : {
"type" : "long"
}
}
}```
I want to run a query that retrieves the top 20 most relevant chunks across all documents, based on the cosine similarity of their embeddings to a query vector. However, I'm finding it difficult to do this because the `size` parameter in `inner_hits` only limits the number of chunks per document, not the total number of chunks across all documents.
Here's the Elasticsearch query I'm currently using:
```{
"query": {
"nested": {
"path": "embedingContent",
"query": {
"script_score": {
"query": {"match_all": {}},
"script": {
"source": "cosineSimilarity(params.query_vector, 'embedingContent.embedding') + 1.0",
"params": {"query_vector": [0.1, 0.2, 0.3, ...]} // Example query vector
}
}
},
"inner_hits": {
"size": 20
}
}
},
"_source": false
}```
Does anyone know of a way to limit the total number of inner hits (chunks) returned across all documents? Any help would be greatly appreciated.
---
Please replace `[0.1, 0.2, 0.3, ...]` with a representative example of your query vector, and feel free to modify the question to better fit your specific situation.
Hope this helps, and best of luck with your question!