Dramatic performance degradation on range queries after upgrade ES from 2.4 to 7.1

einfoman · September 26, 2019, 3:55am

Hi,

I have an elasticsearch cluster of version 2.4 (lucene version 5.5.4). Near I installed a new cluster of version 7.1 (lucene version 8.1.0) and indexed all data (~2.7 billion documents). Both of cluster use Java 1.8.

But I noticed the significant increase of latency for queries with range clauses:

...
 {
      "range": {
        "metadata.sums": {
          "from": 1127039.0,
          "to": 1127040.0,
          "include_lower": true,
          "include_upper": false,
          "boost": 1.0
        }
      }
...

The old cluster executes that queries 304 msec but for the new cluster the same query takes it 20 sec.

I can't understand what's the matter. What's changed? How to fix the problem?

Thanks

Ignacio_Vera · September 26, 2019, 5:00am

Could you share the mapping for that field?

einfoman · September 26, 2019, 6:19am

of course:

...
"metadata": {
    "properties": {
        ...,
        "sums": {
            "type": "double"
        }
    }
},
...

Ignacio_Vera · September 26, 2019, 7:33am

That should perform ok, would it be possible to run the query using the profile API?. Hopefully that will tell us what the query is doing.

einfoman · September 26, 2019, 8:02am

That's the heaviest part (in fact, I have 4 that parts):

{
   "type":"IndexOrDocValuesQuery",
   "description":"metadata.sums:[1128539.0 TO 1128539.9999999998]",
   "time_in_nanos":3903830947,
   "breakdown":{
      "set_min_competitive_score_count":0,
      "match_count":0,
      "shallow_advance_count":45,
      "set_min_competitive_score":0,
      "next_doc":0,
      "match":0,
      "next_doc_count":0,
      "score_count":0,
      "compute_max_score_count":45,
      "compute_max_score":6769,
      "advance":3903093539,
      "advance_count":15,
      "score":0,
      "build_scorer_count":75,
      "create_weight":6020,
      "shallow_advance":9449,
      "create_weight_count":1,
      "build_scorer":714989
   }
}

It takes 3.9 sec.
I am wondering what means advance..

Ignacio_Vera · September 26, 2019, 8:51am

IndexOrDocValues is a type of query that depending on the execution it decides if the condition is resolved using the index or using docValues (brute force). Imagine you have a query with two filters A and B. If A only matches a handful of documents, it might be expensive to resolve B using the index, instead we can decide to check those few documents directly using the existing docValues (brute force approach).

I am speculating with one idea but I am afraid I need more information. I am wondering if the field metadata.sums is actually holding by arrays of doubles? In addition what is the topology go the index, number of shards? number documents per shard?

einfoman · September 26, 2019, 9:44am

First about A and B filters. Actually the count of such filters (should clause) is not limited (although with only such condition a query is executed for 5+ sec).

Then, about configuration. There are 2.7 billion documents in each cluster.
Old cluster: 16 shards (from 121815021 to 398345655 doc per shard)
New cluster: 32 shards (from 40839388 to 178946503 doc per shard).

The field metadata.sums holds sums of money - yes, it is array

Ignacio_Vera · September 26, 2019, 10:15am

Here you can read what has changed in Elasticsearch regarding numeric fields:

I am a bit puzzle that even with only one range condition the query still takes 5+ seconds. Could you run it with only that condition and with the profiler to see where it is spending most of the time?

einfoman · September 26, 2019, 11:48am

Hm...
If I execute a such query:

{
   "size":100,
   "profile":true,
   "query":{
      "bool":{
         "should":[
            {
               "range":{
                  "metadata.sums":{
                     "from":1128539.0,
                     "to":1128540.0,
                     "include_lower":true,
                     "include_upper":false,
                     "boost":1.0
                  }
               }
            }
         ],
         "minimum_should_match":"1",
         "boost":1.0
      }
   }
}

then I'll get response very quickly:

{
   "took":3,
   "timed_out":false,
   "_shards":{
      "total":1,
      "successful":1,
      "skipped":0,
      "failed":0
   },
   "hits":{
      "total":{
         "value":0,
         "relation":"eq"
      },
      "max_score":null,
      "hits":[

      ]
   },
   "profile":{
      "shards":[
         {
            "id":"[7k_HXKpPQBiijcQNPpiUCg][documents-v3][4]",
            "searches":[
               {
                  "query":[
                     {
                        "type":"IndexOrDocValuesQuery",
                        "description":"metadata.sums:[1128539.0 TO 1128539.9999999998]",
                        "time_in_nanos":1961299,
                        "breakdown":{
                           "set_min_competitive_score_count":0,
                           "match_count":0,
                           "shallow_advance_count":0,
                           "set_min_competitive_score":0,
                           "next_doc":0,
                           "match":0,
                           "next_doc_count":0,
                           "score_count":0,
                           "compute_max_score_count":0,
                           "compute_max_score":0,
                           "advance":6050,
                           "advance_count":41,
                           "score":0,
                           "build_scorer_count":84,
                           "create_weight":1187,
                           "shallow_advance":0,
                           "create_weight_count":1,
                           "build_scorer":1953936
                        }
                     }
                  ],
                  "rewrite_time":50759,
                  "collector":[
                     {
                        "name":"CancellableCollector",
                        "reason":"search_cancelled",
                        "time_in_nanos":18202,
                        "children":[
                           {
                              "name":"SimpleTopScoreDocCollector",
                              "reason":"search_top_hits",
                              "time_in_nanos":7412
                           }
                        ]
                     }
                  ]
               }
            ],
            "aggregations":[

            ]
         }
      ]
   }
}

IndexOrDocValuesQuery takes 1.9 msec and cancelled

einfoman · September 26, 2019, 12:17pm

The interesting thing is that probably I found condition without that the query works quickly.
The index is documents of many organizations. Each organization has own boxId parameter and every document has boxId field. The first condition of any query to the index is:

{
   "filter":[
      {
         "term":{
            "id.boxId":{
               "value":"{some boxId}",
               "boost":1.0
            }
         },
         ...
      }
   ]
}

If I remove this condition, query is very fast.
But why the old cluster works fine?

Ignacio_Vera · September 26, 2019, 12:50pm

That actually makes sense and I think it confirms what I was thinking. The problem is that in newer versions of ES we use this IndexOrDocValues from Lucene which decides to execute the query either using the index or docValues, depending in some heuristics.

In your case, you probably are storing quite big arrays in the sums field which is screwing up those heuristics and we are taking a bad decision on how to execute the query.

One way to overcome this is to index that field without docValues. The side effect is that you cannot use that field for sorting or aggregations.

Meanwhile I am opening an issue in Lucene to improve such heuristics.

Ignacio_Vera · September 26, 2019, 3:54pm

FYI: https://issues.apache.org/jira/browse/LUCENE-8990

einfoman · September 26, 2019, 5:27pm

Wow!
I am not sure about quite big arrays but not using docValues on that field — that works for me, because I don't need to do sorting or aggregations on that filed.

I'll try. I don't know how much it'll takes (I need to add a storage). And subscribed to the ticket.

Thank you!

Ignacio_Vera · September 27, 2019, 7:53am

Please let us know the outcome. Thanks for reporting!

jpountz · October 4, 2019, 12:08pm

@einfoman I think that the bug that Ignacio found is real, but I'm also surprised it took so long to only advance the doc-value iterator 15 times. Are you on an especially busy/slow disk? If you still have that cluster and if that very slow response time is reproducible, I'd be curious if you could run the query in a tight loop and collect nodes hot threads concurrently to try to understand what the bottleneck of that query is.

einfoman · October 6, 2019, 7:10pm

@Ignacio_Vera I reindex data without doc_values. It worked! Thank you for this workaround

einfoman · October 6, 2019, 7:38pm

@jpountz Disks are SSD. I wouldn't say that its are busy.

The old index (with doc_values) is still alive. Now is the night and workload is very small. I executed the query and during it run _nodes/hot_threads. Here is result (I got it every time during the query execution): https://gist.github.com/einfoman/77081cca975f180c717e3ac1b7c4a239

I don't know if it is)

jpountz · October 7, 2019, 8:12am

Thanks @einfoman, this answer my question. Your query is also suffering from the fact that disjunctions don't work well with two-phase iterators, which are created by phrase queries, script queries and sometimes range queries. Ignacio's suggested workaround happens to address this issue as well.

einfoman · October 7, 2019, 12:06pm

Thanks

system · November 4, 2019, 12:06pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Average search latency increase 900% from 7.4 to 7.10 Elasticsearch	3	576	August 1, 2022
Performance degradation after ES upgrade (v1.7 -> v2.4) Elasticsearch	9	2121	July 5, 2017
Latency spike after big merge Elasticsearch	2	1153	April 14, 2018
Elasticsearch performance degraded after migration from 1.4 to 5.4 Elasticsearch	6	863	March 26, 2018
@timestamp field slow range query "DocValuesFieldExistsQuery" Elasticsearch	13	1350	July 20, 2020

Dramatic performance degradation on range queries after upgrade ES from 2.4 to 7.1

Related topics