Accessing TopDocs from a LeafScoreFunction


(Ethan Roseman) #1

My name is Ethan Roseman, and I'm a developer at Basis Technology. We work on an Elasticsearch plugin that integrates our product, RNI (https://www.basistech.com/text-analytics/rosette/name-indexer/).

With our plugin, we define a custom ScoreFunction in which we have previously used the SearchContext to get at the TopDocs from the query. We look at the raw Lucene scores from the TopDocs.scoreDocs to see whether the document scored above a certain threshold value, as our score function is computationally expensive, and we have implemented measures to not rescore documents that do not look promising to boost performance.

Now that SearchContext.current is gone as of 5.1.1, we are looking for a new way to access the raw Lucene scores from ES in our ScoreFunction / LeafScoreFunction. Is the only way to do this through SearchContext, which is no longer accessible to us?

Thanks,
Ethan


(Ryan Ernst) #2

If you implement ScoreFunction.needsScores()to return true, then the second argument of the subQueryScore argument of LeafScoreFunction.score should have the original score for the query. For example, to implement a score function which multiplies the original score by 2:

public class MyScoreFunction {
  @Override
  public boolean needsScores() {
    return true;
  }

  @Override
  public LeafScoreFunction getLeafScoreFunction(LeafReaderContext ctx) throws IOException {
    return new LeafScoreFunction() {
      @Override
      public double score(int docId, float subQueryScore) throws IOException {
        return 2.0f * subQueryScore;
      }
    }
  }

  ...
}

(Ethan Roseman) #3

Thanks for your prompt reply. Unfortunately, the subQueryScore seems to always be 1.0. I can't find a single instance in which it is not. However, if we look through the ScoreDocs obtained by the SearchContext's TopDocs, we see a huge variety of scores, none of which are below 10, let alone an exact 1.0. What is the difference between the "subQueryScore" and the score in the actual ScoreDoc?

Edit: I'm realizing that the reason we're getting 1.0 for every document is that the "subquery" of the FunctionScoreQuery is just a MatchAllDocsQuery. In other words, we're not getting the score back from our original query at all. To help illustrate better, here's an example query:

{
"explain": true,
"query": {
    "match": {
        "primary_name": "Abdallah Ghmori"
    }
},
"rescore": {
    "window_size": 200,
    "query": {
        "rescore_query": {
            "function_score": {
                "name_score": {
                    "field": "primary_name",
                    "query_name": "Abdallah Ghmori"
                }
            }
        },
        "query_weight": 0.0
    }
},
"size": 1
}

Our desire is to be able to get the score back from the query before the "rescore" section, which we have previously been able to do by looking at TopDocs.scoreDocs

Also, I believe I may have made a mistake in the phrasing of my question. While being able to see the original query score in the LeafScoreFunction would be helpful, we still would like to be able to see the TopDocs for all scored documents, not just a single one at a time. I can submit a new topic if preferred, since I realize I was asking for something a bit more broad above.


(Ryan Ernst) #4

I think I see the problem. Your rescore query does not actually have any scoring query component. That is why you see 1.0 for the score (it essentially a match all query for the docs that matched the top 200 docs that matched the first query). I do not think there is a way to get the score from the first query; each rescore is a separate phase. You might try opening a feature request for this, at least so it can be discussed (it may be too complicated to be worthwhile).


(Ethan Roseman) #5

Thanks, I actually have opened an issue yesterday but it was closed after your first reply in this topic. I'm hoping our issue can be further discussed (and reopened) there: https://github.com/elastic/elasticsearch/issues/26105


(Igor Motov) #6

Ethan, so the way I understood your requirements you need to limit rescore calls to the top 200 records that have a certain minimal score. Do you need to know the actual value of this score for any other purposes besides filtering?


(Ethan Roseman) #7

We have a function that computes a sensible threshold score after taking into account all the raw Lucene scores retrieved so far (which we have gotten via SearchContext.queryResult().topDocs().scoreDocs), so we ideally would like to be able to access this somehow. The subQueryScore field of LeafScoreFunction.score would help us a little if the score we got was the one we were looking for, but it's not an ideal solution for us.

I didn't mention it in this post yet, but there are a few other things we look at in the SearchContext: rescore(), size(), numberOfShards(). Additionally, one of the places we have relied upon accessing these is within a MappedFieldType of ours. We use information about the rescore window size in creating our Lucene query for one of our custom field types.

This is why in the issue I've opened, I suggested that the SearchContext be accessible from the QueryShardContext, which I know is available to us in both the ScoreFunction and the MappedFieldType.


(Jimferenczi) #8

For the rescore problem you could copy the main query in the rescore query. That's a quick workaround but it would give you access to the original query score.
For a better fix we could give more flexibility to the rescore phase. Currently you can only change the query but we could add a pre-filter phase where you could compute your threshold to build the rescore query ?


(Ethan Roseman) #9

Thanks for your reply.

I considered putting the main query in the rescore query, but I'm actually not quite sure how I would do that programatically. Or are you suggesting that our plugin users put their query in two places in their request, copying it into the "rescore" section?

The other potential qualm I have with this suggestion is that it would be running the query twice. We're very sensitive to performance changes, and running duplicates of all the queries doesn't seem very efficient.

Currently you can only change the query

Could you explain what you mean by this? Regardless, I believe more flexibility would be very helpful for us, but ideally we would want to be able to see all of the "original" query scores at once. Would this be possible in the change you're suggesting?


(Jimferenczi) #10

Yes not ideal but that's just a workaround for the current version :wink:

Could you explain what you mean by this? Regardless, I believe more flexibility would be very helpful for us, but ideally we would want to be able to see all of the “original” query scores at once. Would this be possible in the change you’re suggesting?

I didn't think about it much but it could be something like:

interface CustomQueryRescorer {
   Query rescoreQuery(QueryShardContext context, TopDocs topDocs);
}

... so that you can programmatically build the query based on the topDocs. This would keep the thing simple and give more flexibility to the rescorer by allowing to change the query based on the first phase result.


(Ethan Roseman) #11

Ahh, I see. As long as there would be some way to pass our computed threshold down into our custom ScoreFunction / LeafScoreFunction, this seems like it would be very helpful. Would that be possible?


(Jimferenczi) #12

Would that be possible?

Yes but that requires to add a custom rescorer that is able to programmatically creates a query from the TopDocs. Currently you can only define the query that is run on the second pass but with this change you could create a query that depends on the TopDocs returned in the first pass. It's just one idea to solve your problem but I think we need to discuss more internally to see what could be done. Can you open a new issue with a feature request explaining what you want to achieve with your plugin (sorry I know you already opened one but this one would only describe the expected feature rather than proposing a solution) ?


(Ethan Roseman) #13

I'm happy to write an issue, but I just want to make sure I'm doing it in the desired way.

If I am to break it down completely, there are three main things we would like to have access to at various places.

  1. TopDocs: to be able to compute a "minimum score threshold" based off of first pass scores so we can use this value in our ScoreFunction as a threshold. This would require a mechanism for getting the first-pass score of the current document to be able to compare it with our computed threshold. Is this wording better? Edit: I went ahead and created this issue. Hopefully this is better.

  2. The rescore window size for the query: We use this to determine how many names to rescore, as well. Currently this is being done in our ScoreFunction and in our MappedFieldType as an argument in the way we generate the Lucene term query for our custom field type. We could use the prior suggested solution of just inserting this into our custom score function definition, but this is not ideal since it requires customers to put the same values in multiple places. Would this be a separate issue I can open up in github?

  3. The (not rescore) window size and the number of shards: used as well in our custom MappedFieldType in helping us generate our Lucene query for our custom field type. These are also values that could be handled the same way as rescore window size (manual copying), but it's not ideal.

Also, I have a question about SearchContext.rescore(); we've previously iterated through these and gotten the RescoreSearchContext with the greatest window size as our means of getting the rescore window size. Why can there be multiple RescoreSearchContexts here? What does each one represent?


(system) #14

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.