Filter context and Build Scorers

We are running some rather large queries (thousands of terms) and they take dozens of seconds. Using the Profile API we can see scorers are built - even though we are running the large queries in a filter context.

Can someone please advise why scorers are built despite queries being run in the filter context, and what can be done to speed up the query potentially containing thousands of terms / points?

Our query (the imei and mail fields are keyword fields, source_ip is IP data type):

{
  "size": "0",
  "query": {
    "bool": {
      "must_not": [
        {
          "terms": {
            "imei": [
            (100 terms)
            ]
          }
        },
        {
          "terms": {
            "mail": [
              (100 terms)
            ]
          }
        }
      ],
      "filter": {
        "terms": {
          "source_ip": [
            (42000 ips)
          ]
        }
      }
    }
  }
}

Result of the Profile API (cropped) - note the build scorer part of the source_ip query:

{
    "took": 52899,
    "hits": { 
        "total": 178603,
        "max_score": 0.0,
        "hits": []
    },
    "profile": {
        "shards": [
            {...}    
            {
                "id": "[J3q3J7lqS4K1BVgeVcttUQ][xdr20181127][0]",
                "searches": [
                    {
                        "query": [
                            {
                                "type": "BooleanQuery",
                                "description": "-imei:(10036520012720226020065201601...",
                                "time_in_nanos": **37889706864**,
                                "breakdown": {
                                    "score": 0,
                                    "build_scorer_count": 276,
                                    "match_count": 21506,
                                    "create_weight": 8439,
                                    "next_doc": 26548682,
                                    "match": 17920605,
                                    "create_weight_count": 1,
                                    "next_doc_count": 21671,
                                    "score_count": 0,
                                    "build_scorer": **37845185684**,
                                    "advance": 0,
                                    "advance_count": 0
                                },
                                "children": [
                                    {...},
                                
                                    {
                                        "type": "PointInSetQuery",
                                        "description": "source_ip:{10.0.71.61 10.0.99.20...}",
                                        "time_in_nanos": **37372365796**,
                                        "breakdown": {
                                            "score": 0,
                                            "build_scorer_count": 414,
                                            "match_count": 0,
                                            "create_weight": 2205,
                                            "next_doc": 8767915,
                                            "match": 0,
                                            "create_weight_count": 1,
                                            "next_doc_count": 21671,
                                            "score_count": 0,
                                            "build_scorer": **37363573590**,
                                            "advance": 0,
                                            "advance_count": 0
                                        }
                                    }
                                ]
                            },
3 Likes

Scorers are an abstraction that expose an iterator over documents matching a query in increasing order of doc Id and than can optionally score these documents. All queries have an associated scorer, it doesn't matter if they are executed as filters or not. The ip field builds a PointInSetQueryand the keyword field creates a Terms query, they both build a ConstantScoreScorer which when the score is needed will always return 1. The reason why build_scorer is slow on these queries is that this phase is used to create a bitset of the documents that match the filter. These queries don't run a disjunction over all terms at once, there are too many. Instead it iterates over all terms sequentially and builds the bitset incrementally.
The benefit of the ConstantScoreScorer is that the query is cacheable even if it is not in a filter context because the score is the same for all matching documents.
Bottom line is that what you see is expected, a query over 40k terms is slow. Is the set of ips per query unique ?

Hi Jim, thank you for your response.

I'm a colleague of Erez in this project.

It mostly is I'm afraid. This leaves a few quesions:

  1. Given the current LRU implementation of the query cache prospects, is this going to spam the LRU queue and hurt caching for other repeating queries which otherwise would be cachable?

  2. Are there any differences between TermsQuery and PointsInSetQuery in this regard, e.g. would running a 40k terms query on a keyword field be better than using an IP field in this case?

  3. Is there any way whatsoever to speed-up such queries, e.g. force disjunction, avoid creating the bitsets, or any other hack or trick? assuming the sets are not unique and probably also hardly any subsets will be unique?

I'm well aware of Lucene's limitation - I was around back in the day of the dreaded MaxClauseException, but this current use-case is valid and currently takes too long to run.

Thanks again!

No, big terms queries are not cachable in 6x so the cache is not an issue.

Are there any differences between TermsQuery and PointsInSetQuery in this regard, e.g. would running a 40k terms query on a keyword field be better than using an IP field in this case?

I am not sure but this is worth testing. A single exact match term query is faster to execute on a keyword field but 40k terms might not.

Is there any way whatsoever to speed-up such queries, e.g. force disjunction, avoid creating the bitsets, or any other hack or trick? assuming the sets are not unique and probably also hardly any subsets will be unique?

We don't optimize this case, querying 40k terms is expected to be slow so we try to encourage users to change their design. Would it be possible to index groups rather than ips and use a single identifier per group in the query ? This solution might require some reindexing if your group changes dynamically but the performance of the query side should minimize this cost ?

I'm afraid not. This is a two-phase query scenario and this list is generated from many scrolled results, and we can't group or use CIDR blocks or anything like that.

Any other tricks we can use to speed up such heavy queries?

So I did some digging and thinking and found two possible avenues where speedups of such queries could be possible to do. Please free to correct me if I'm terribly off here.

  1. Setting "track_total_hits": false. (which from what I gather is the maxscore optimization -[LUCENE-4100] Maxscore - Efficient Scoring - ASF JIRA). As the docs state:

If you don't need to track the total number of hits you can improve query times
by setting this option to false. In such case the search can efficiently skip
non-competitive hits because it doesn't need to count all matches

For queries not specifying a sort explicitly (therefore, sorting by score), can the default behavior of iterator over documents matching a query in increasing order of doc Id be avoided? e.g. taking advantage of maxscore and avoiding computing the total hits to speed-up ultra-terms queries?

  1. We also experimented with terminate_after, which not surprisingly didn't have any effect on this behavior. Wouldn't it be possible and also make sense to entirely avoid building the results bitset when termintate_after was specified, and iterate lazily over matching documents in an ultra-term disjunction query?

cc @jpountz

Unfortunately no, this is a requirement of the API. Lucene used to allow collecting documents out-of-order but this was removed because it made things complex.

In my experience, yes: terms perform better than points for large terms queries. I don't think it's going to solve your problem entirely, but maybe make things a bit less slow.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.