I have a query which makes use of collapse to de-duplicate results. I'm expecting some collapsed results to have multiple duplicates, and some to be unique. I would like to sort the entries in such a way that the results with the most duplicates come first.
I was expecting something like a score_mode
parameter similar to nested queries, which would cause the nested documents to affect the score of the collapsed bucket - but that doesn't exist at the moment.
Simplified Example
Suppose I have two indexes - one containing song metadata, and one containing song lyrics. I want to query over both at the same time to find songs based on fields on both documents.
The query might look like this:
{
"query": {
"bool": {
"should": [
{
"term": {
"artist": "johnny"
}
},
{
"terms": {
"lyrics": [
"the",
"cat"
]
}
}
]
}
},
"collapse": {
"field": "song"
}
}
I would want to sort results which hit both should clauses to the top of the results, but I can't see a way to do this.
I think the standard advice would be to merge the documents, but unfortunately in my actual use-case the song-lyrics index is populated by a task which takes a long time to run. In the meantime I want the metadata to be available to query, and updating a single document once the result is in causes major issues with the cluster performance.
Does anyone know of a way to accomplish this without merging the documents together?