Custom TopDocsCollector in plugin?

Hello!

I'm working on a plugin to add some cheminformatics features to our Elasticsearch cluster. Several of these involve steps where there is an inverted index based screening stage (which fits well as a Lucene query), followed by a more expensive algorithm to eliminate false positives.

The nature of these algorithms allows for utilization of the setMinCompetitiveScore API to skip these expensive steps when sorting by score, which allows us to keep the query tractable even for non-selective queries. However, we would also like to be able to sort by other fields, which creates a challenge out of the box due to the way Lucene sorting works - it requires collecting all of the hits to sort without potentially losing hits.

One potential solution I'm considering is creating a custom doc collector and moving the check for a false positive there, so that the psuedocode would look something like this:

get next doc id from screening step
check sort value of doc against priority queue; if it won't make it into the top K hits, skip it
else run the expensive false positive check and collect result if it passes

However, I'm not sure if there is a place to inject this doc collector via a plugin - it's not obvious to me if the plugin interface supports that. Any guidance on how I can accomplish this? As a last resort, I was considering creating a custom action plugin that exposes our own search API, but I'd rather not have to keep that up to date with the evolution of changes in the main search API, since we'll be composing this with other search capabilities.

I don't believe you can do it now for the top level search. You can do it for aggs now, but that doesn't let you touch the main search collector. If you are just doing an agg on the matches anyway it'd work fine. Aggs like terms delays collection of the sub-aggs until we know what buckets are selected. If that's the kind of thing you are after you can look there.

If not, is there some way you could hack it into a custom Weight? It's quite normal for plugins to make custom queries.

We are already building custom weights and scorers for the queries, but the issue is that AFAIK those don't have the required information about the top K sort values when sorting by something other than score. Am I mistaken there?

Unfortunately this isn't for aggs, we're using it to display traditional search results for things like chemical substructure matches.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.