I would like to accomplish the following with a query:
1.) Perform a complex query that calculates a score for each matched document
2.) Rescore top N documents using a cosineSimilarity function for the dense_vector field type
3.) Combine the original score from the complex query with the cosine similarity according to the function: Math.log10(original_score) * cosineSimilarity_score.
I have been able to get the desired effect in two ways, but both are inefficient.
The first is to duplicate the entire original query in the query portion of the function score query in the rescoring. This seems pretty inefficient especially when the number of postings lookups is high.
The second is to wrap the original query with a function score query that uses a script score to apply the Math.log10 to every matched document which requires iterating through all the results rather than just top N.
Are there better alternatives that I haven't considered?
Hello,
indeed, it looks like we need an another type of rescorer that combines scores from an original query and a rescore query in a more flexible way than just a linear combination.
For now, I can only suggest you something that you have already tried (your first method). You were saying that it is inefficient to duplicate the original query it the rescore, but consider that the original query will only be run on a small number of documents (window_size), as we can efficiently position only on those documents during rescoring:
The second method looks to be more inefficient to me, as it needs to calculate this script score for every matched document, and we can't use any optimizations that skips non-competitive docs.
For the method involving duplicating the query, if I’m understanding correctly, we expect it to be more efficient because the rescore windowed documents will be applied as a filter criteria. Though es still needs to look up all the full posting lists for all the query components for the original query and do the linear scan through the ids (and the postings are not on the order of the window size).
Two more questions that I have
I believe you are working on incorporating approximate nearest neighbor search and I would like to follow updates on the issue. What’s the best way to track the status? And is there a current release target for that feature yet?
Also I would find it useful to subtract score without using a script. I know Lucene doesn’t support negative score and it’s possible to decrease score by using a <1 multiplier but it’s harder to reason about the weights in some cases. Might it be possible to have an operation that could clip to zero in the case of producing a negative score? The use case is when you train a linear combination of feature weights that might have negative or positive sign. Also good for discriminative terms.
We have created a new issue to introduce a rescorer based on a script.
if I’m understanding correctly, we expect it to be more efficient because the rescore windowed documents will be applied as a filter criteria.
Indeed we only go through the rescore windowed documents as a filter criteria. But we still need to calculate scores for every document in this windowed documents set, which is a waste as we already calculated them. We hope that script rescore issue will address this.
Though es still needs to look up all the full posting lists for all the query components for the original query and do the linear scan through the ids (and the postings are not on the order of the window size).
Posting iterators can efficiently advance to a particular document using a skip list without a need to do the linear scan. There are also optimizations to skip loading parts of postings's data that are not necessary.
approximate nearest neighbor search and I would like to follow updates on the issue.
We intend soon to create a public github issue with our roadmap and plan.
Also I would find it useful to subtract score without using a script. I know Lucene doesn’t support negative score and it’s possible to decrease score by using a <1 multiplier but it’s harder to reason about the weights in some cases. Might it be possible to have an operation that could clip to zero in the case of producing a negative score? The use case is when you train a linear combination of feature weights that might have negative or positive sign. Also good for discriminative terms.
Indeed, no query in lucene or elasticsearch, even a script_score is allowed to produce negative scores. But you can use Math.min(0, _score) within a script_score to clip scores to 0.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.