ES Rewriting range to @timestamp to BooleanQuery / TermQuery - Why?

Asaf_Mesika · July 26, 2016, 7:51am

Hi,

I'm running a very simple query which uses range on @timestmap field (Type: date).
From some reason I see using Profile API than it is rewritten to multiple TermQuery on this field.
I was wondering why is that happening and is it suppose to be faster than range query?

Query
{
"profile": "true",
"query": {
"filtered": {
"query": {
"query_string": {
"analyze_wildcard": true,
"query": "*"
}
},
"filter": {
"bool": {
"must": [
{
"range": {
"@timestamp": {
"gte": "1468829704652",
"lte": "1469434504652"
}
}
}
],
"must_not": []
}
}
}
},
"size": 500,

  "fields": [
    "*",
    "_source"
  ],
  "script_fields": {}
}

Query API shows many of this:

  {
    "query_type": "BooleanQuery",
    "lucene": "@timestamp:0 \u0000\u0000\nX0o @timestamp:0 \u0000\u0000\nX0p @timestamp:0 \u0000\u0000\nX0q @timestamp:0 \u0000\u0000\nX0r @timestamp:0 \u0000\u0000\nX0s @timestamp:0 \u0000\u0000\nX0t @timestamp:0 \u0000\u0000\nX3,",
    "time": "0.5339960000ms",
    "breakdown": {
      "score": 0,
      "create_weight": 42867,
      "build_scorer": 33021,
      "match": 0,
      "advance": 0,
      "next_doc": 349037
    },

Any idea?

polyfractal · July 29, 2016, 1:49pm

Good question! So this is due to some internal optimizations that Lucene makes. The summary can be found in the comment header of MultiTermQueryConstantScoreWrapper:

This class also provides the functionality behind

{@link MultiTermQuery#CONSTANT_SCORE_REWRITE}.

It tries to rewrite per-segment as a boolean query

that returns a constant score and otherwise fills a

bit set with matches and builds a Scorer on top of

this bit set.

Basically, the range is evaluated on each individual segment. If the segment only holds a small number of matching terms (16 or less), it rewrites the range into a boolean of individual terms. If the segment matches a larger number of terms, it generates a bitset and iterates over that as a "normal" range.

The reason is down to speed: generating a bitset for all the documents in an index takes a certain amount of time. If there are not many terms to evaluate (which we can determine based on the term-dictionary for the segment), it's faster to skip the bitset generation and just check the terms individually with a boolean.

But booleans slow down as there are more terms to evaluate, so at some point it makes sense to pay the cost of building the bitset, because we'll make up the time during the range evaluation because there are many terms to check.

If you were to re-run your profile where each segment is matching many terms, you'll see the output change.

Also note: in 5.0+, the lucene output is much friendler. It won't spam a bunch of binary terms, but instead show a simple [0 TO 10] output

CherryDT · September 12, 2016, 2:12pm

Hm, I have the problem that this behavior causes an otherwise simple query to immediately overflow my search queue (capacity 1000) and then causing 4000 rejections - making the whole system unusable for a while.

Is there any way to disable this feature?

polyfractal · September 12, 2016, 3:44pm

I think you're encountering a different, unrelated problem. The query expansion/rewrite process is still occurring in a single search context... e.g. under a single thread. The process described above won't fill up your search queue.

The search queue is filling up due to multiple concurrent queries that are being executed, not because of one query that is "expanding" to multiple search contexts. I'd suggest opening a thread about your problem to get more help, since it's likely unrelated to this thread.

Topic		Replies	Views
Time range query performance [7.6] Elasticsearch	18	5529	May 5, 2020
Nested range query is slow, slower with _cache, how to debug? Elasticsearch	6	2491	July 6, 2017
Question on ES query optimization Elastic Search	3	70	October 24, 2024
Indexing and range querying with _timestamp (is there an example?) Elasticsearch	5	423	July 6, 2017
Slow filter execution Elasticsearch	8	1997	July 6, 2017

ES Rewriting range to @timestamp to BooleanQuery / TermQuery - Why?

Related topics