Terms query slowing down at specific size

Hi,

we were noticing some odd query runtime behaviour, while investigating ways to improve our ES queries. We would like to ask you about some insights about that.

Given the following mapping:

{
    "settings": { ... },
    "mappings": {
        "properties": {
            "ptIds": {
                "type": "keyword"
            }
        }
    }
}

We indexed 1 000 000 documents like:

{
    "ptIds": [1049, 1325, 812, ... ]
}

The array contains only numbers.
The range of the numbers is between 1 and 2000.
The size of the array is between 1 and 150.

We now queried our index with a terms query like:

{
    "size": 60,
    "query": {
        "terms": {
            "ptIds": [ ... ]
        }
    }
}

The following image shows the behaviour of the query runtime, when increasing the number of Ids in the query. Everything is fine until 16 Ids. But if we use the terms query with 17 Ids, the runtime makes a sudden jump.

query-runtime

The image was created with Elasticsearch 7.7.1, on our local machine with docker.
But our production cluster (v7.7.1, 26 Nodes, Index: shard 6 * 3, docs 921 590 140, size 201.74GB) shows the same characteristic.

Elasticseach 7.15.1 shows the same characteristic and runtimes.
Elasticseach 8.0.0-alpha2 shows the same characteristic and but worse runtimes. A term query with 70 Ids takes about 400ms.

Can some please shed some light on this behaviour?
How can we improve this query?

1 Like

Internally there are two strategies for executing terms queries - the second kicks in after 16 terms.

Before describing them it is worth pointing out the data structures both approaches access.
The index has:

  1. a term dictionary (sorted unique list of terms) and
  2. "postings lists" of sorted document IDs - one for each term.

When a query runs it looks up a term in 1) to find the location on disk of its relevant postings list 2).
When you have <=16 terms Lucene will read the terms' posting lists in parallel, with some special optimisations. When collecting the 10 top scoring docs Lucene can avoid reading all document IDs in all posting lists if it knows none of them are going to contain sufficiently high-scoring docs. It is the boring, common terms that have many many docs listed in the postings that score low and also take a long time to traverse. Avoiding reading these is a big win when we know we've already collected the highest scoring docs based on matches with rarer terms. Low-scoring terms can be dropped.
The above is is the strategy for low numbers of terms but the challenge is that each posting list we traverse has a memory overhead for the read buffer used for postings. That doesn't scale when we have hundreds or thousands of unique terms. Instead, we read each of the postings lists serially, one after the other, and record each of the doc matches as a bit in a single bitset. In this case all of the posting lists for all terms are read in their entirety (which is slow) but keeps a fixed memory cost which helps us avoid out-of-memory issues.

The number 16 is hardcoded but you can use bool queries with should clauses to concoct queries of your own with multiple term or terms queries.

Hi @Mark_Harwood

thank you for your insights.

We hope, we understood your hint correctly and came up with the following queries.
Unfortunately, none of them is improving our situation.

Term query per Id

{
    "size": 60,
    "query": {
      "bool": {
        "should": [
          { "term": { "ptIds": 1482 } },
          { "term": { "ptIds": 981 } },
          // and so on
        ]
      }
    }
  }

multiple term

multiple terms queries with up to 16 ids

{
    "size": 60,
    "query": {
      "bool": {
        "should": [
          { "terms": { "ptIds": [ 1266, 1047, ... ] } }, // up to 16
          { "terms": { "ptIds": [ 6, 1183, ... ] } }, // up to 16
          // and so on
        ]
      }
    }
  }

multiple terms1

Since you mentioned scoring in your explanation, I would like to add, that we don't need it in our case. We just need to filter the documents by the mentioned ids.

Are there any other options we could try, to improve our situation?

When benchmarking you have to take care to ensure things are not being cached either at the Elasticsearch level or the file system level. Can we assume some IDs may be more popular than others? Also, I'm not sure how many docs are being retrieved - there's the fetch costs to read the JSON of matching docs off disk. I see you have size:60 so that might explain the plateau.

The graph for term queries looks reasonably linear. More search terms = more disk seeks = more time so is to be expected?
I'm not sure what is going on with the terms queries - they still look to have a 16 threshold despite your attempts to group them. I wonder if Lucene is somehow rewriting multiple terms sets into a combined one as part of query rewriting. That would be surprising, What happens if you use <16 ids per terms query?

Hi Mark,

all our tests are run with ?request_cache=false. We just tried it without this parameter and the results were the same. So, it seems to have no effect at all.
The other ES settings we touched are:

cluster.name: "elasticsearch"
network.host: 0.0.0.0
xpack.security.enabled: false
discovery.type: single-node

They should have no effect either.


Yes, some Ids are more popular. This graph shows how often an Id appears in our Dataset:

Distribution


Yes, the graph of the term queries was expected by us. We just wanted to show it for the sake of completeness.


This is the result with up to 10 Ids per terms query:
multiple-terms-10

The query got slow, as we started to add the second terms query with the first Id (so 11 Ids in total) to the should clause.

The query got slow, as we started to add the second terms query with the first Id (so 11 Ids in total) to the should clause.

Ok so the crunch point looks to be not related to the hardcoded 16 limit inside terms query but to the point where your query can be rewritten as a single terms query for execution versus a more complex BooleanQuery that is executing multiple terms queries. Clearly there is an overhead in managing multiple clauses. It is worth experimenting with nesting your bool query inside the filters context of a containing bool query or wrapping with a constant_score query to remove the scoring aspects.

1 Like

Hi Mark,

thank you very much for your help.

Wrapping everything inside a constant_score query seems to be the solution.

{
  "size": 60,
  "query": {
    "constant_score": {
      "filter": {
        "bool": {
          "should": [
            { "terms": { "ptIds": [ 1266, 1047, ... ] } }, // up to 16
            { "terms": { "ptIds": [ 6, 1183, ... ] } }, // up to 16
            // and so on
          ]
        }
      }
    }
  }
}

This is the resulting graph in our test setup:

5-constant-score-split-terms

The crunch point is gone and the response time for many Ids reduced from about 200ms to 20ms.


For sake of completeness:
This is the profile of the query mentioned above:
ES - terms queries inside constant score (github.com)

This is the profile of the query not wrapped in constant_score:
ES - terms queries (github.com)

(don't mind the time_in_nanos, my machine was under quite some load when creating them. I just want to show the difference in the structure)

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.