Is there any way to perform the aggregations after the rescore? If not, how feasible would it be to write a plugin that performs the rescore function over the main query results, on each shard involved, then aggregate post rescore on each shard, then aggregate those aggregations on the query node of a cluster? If that's feasibly and the only way to achieve the result that I want, could anyone point me in the direction of an existing plugin that does something like this as a base to start from?
Here I run my complex but very fast boolean query for an initial result set. Then I take the top 10000 on each shard from that result set and apply aggregations to it. I order the term aggregations by a max aggregation for each term, using a complex script for vector similarity scoring. Then within each top term bucket I again use the complex script to rank the members, then pick the top hit from that ranking.
Note that instead of using the script on the top hit, I could just use my original rescore function (also on the top 10000 hits per shard) and then rank top hits by "_score". Unfortunately the buckets won't use that rescore value for the max agg that sorts them, since they ignore the rescore. I found that for the top hits at least, it was faster to specify the script directly rather than use the rescore. My solution works, but it does have to calculate the expensive script score twice unfortunately. I haven't been able to find a better way to do this.
Aggregations are designed to work on all matching docs.
Therefore if your agg includes a custom scoring script that does expensive work then it will run on all matching docs which somewhat negates the point of using the rescore query in the first place (the idea being to limit expensive scoring algos to the top docs produced by cheap scoring algos).
If you are going to run expensive scripts you could look at using the sampler or diversified sampler aggregations to limit the set of cheaply-scored docs you run a child scripted aggregation on.
Hi Mark, thanks for the suggestion. In my partial solution above I'm actually using the sampler aggregation to do exactly what you're suggesting. I initially thought a rescore function on the main query would achieve the same result but I realize now that rescore essentially runs after the aggregations and so they will ignore it. You may have noticed in my solution that I still have to run the expensive scoring script twice within the sample, once for ordering the buckets and again to select the top match within each bucket.
Is there any way to avoid this double execution of the expensive script in my solution above? Also, is there a different way entirely to achieve my end goal, I.e. To run a query then select the top N results from that query, run an expensive rescore over them, then select 'distinct' results from among the rescored such that no two results share the same identical_id?
Doh. Missed that, sorry. You still might want to try the diversified sampler though otherwise your top 10000 docs might just all be for the same identical_id leaving no matches for other IDs. Just diversify on the identical_id field.
I'm not expecting to have that many results with the same identical_id, but that could be a solution should I encounter that. I still need to aggregate over the top scored matches from the main query though, does the diversified sampler return items with a probability related to that score, or does it randomly sample over the entire set of matches?
I describe it like the algo for making a 1960s top hits compilation. Without diversity it would just be a Beatles greatest hits compilation. With diversification set on “artist” and value 2 you would get the top 2 Beatles hits and the best of the rest (including one-hit wonders and also ensuring any other popular artists have max 2 hits)
I think in my case the diversified sampler would give me the top hit based on my base query, but not based on the rescore / agg script in the previous examples. So I would need to return 10,000 or so top hits per diversity category, so I could rescore over a large number to get the actual top hit(s) per identical_id that I want.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.