Performance issue on some requests

RVillette · March 9, 2017, 1:54pm

Hi,

With Elasticsearch 2.2.1, we currently have, in production, a cluster with 3 nodes, and their CPU usage is reaching 80% most of the day.
When the CPU usage is too high, our application regularly gets an error message like this one:

Elasticsearch::Transport::Transport::ServerError ([429] {"error":
   {"root_cause":[
     {"type":"es_rejected_execution_exception",
      "reason":"rejected execution of org.elasticsearch.transport.TransportService$4@7be66fbe on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@7fce5848[Running, pool size = 7, active threads = 7, queued tasks = 1000, completed tasks = 2158260596]]"}, ...

By using the thread_pool request, we see that we have a lot of search.rejected occurrences:

$ curl 'es_node_1:9200/_cat/thread_pool?v'
host          ip            bulk.active bulk.queue bulk.rejected index.active index.queue index.rejected search.active search.queue search.rejected 
10.160.79.148 10.160.79.148           0          0             0            0           0              0             7          306        11352252 
10.160.79.146 10.160.79.146           0          0             0            0           0              0             7          311        12758498 
10.160.79.147 10.160.79.147           0          0             0            1           0              0             7          109        14826492

We have one index for every day, 3 primary shards and 3 replicas. Size used for one index:

$ curl -s 'es_node_1:9200/_cat/shards' | grep "analytics-2017-03-07"
analytics-2017-03-07  2 r STARTED 185141  133.6mb 10.160.79.146 es_node_1
analytics-2017-03-07  2 p STARTED 185141  117.7mb 10.160.79.148 es_node_2
analytics-2017-03-07  1 p STARTED 185135  117.9mb 10.160.79.147 es_node_3
analytics-2017-03-07  1 r STARTED 185135  144.8mb 10.160.79.148 es_node_2
analytics-2017-03-07  0 p STARTED 188470  118.4mb 10.160.79.146 es_node_1
analytics-2017-03-07  0 r STARTED 188470  117.4mb 10.160.79.147 es_node_3

After further investigation, we found the slowest query (which is executed about once every second, with some variations in the attribute values). It is run on the documents from the 6 last months.

Here is the query run manually:

$ curl -s -XPOST 'es_node_1:9200/analytics-2016-09-2%2A,analytics-2016-09-30,analytics-2016-10-%2A,analytics-2016-11-%2A,analytics-2016-12-%2A,analytics-2017-01-%2A,analytics-2017-02-%2A,analytics-2017-03-0%2A/cdr/_search?ignore_unavailable=true' --data @/tmp/search.json

The content of search.json:

{
  "size": 0,
  "query": {
    "bool": {
      "filter": {
        "bool": {
          "must": [{
            "range": {
              "header.started_at": {
                "gte": "2016-09-08T22:00:00.000Z",
                "lte": "2017-03-09T09:33:38.989Z"
              }
            }
          }, {
            "bool": {
              "should": [{
                "term": {
                  "header.registration_key": "4dbfc6c0-eead-0133-0530-0050569746d9"
                }
              }, {
                "term": {
                  "header.registration_key": "f0a44c10-eeaf-0133-0531-0050569746d9"
                }
              }, {
                "term": {
                  "header.registration_key": "5dd52440-f00c-0133-347c-00505697369d"
                }
              }]
            }
          }],
          "should": [{
            "term": {
              "header.called_entity.uuid": "35070630-60b6-0134-40cb-00505697111b"
            }
          }, {
            "term": {
              "header.coverage_entity.uuid": "35070630-60b6-0134-40cb-00505697111b"
            }
          }]
        }
      }
    }
  },
  "aggregations": {
    "per_interval": {
      "date_histogram": {
        "field": "header.started_at",
        "interval": "10000w",
        "extended_bounds": {
          "min": "2016-09-08T22:00:00.000Z",
          "max": "2017-03-09T09:33:38.989Z"
        }
      },
      "aggregations": {
        "unprocessed_calls_for_more_than_1_day": {
          "filter": {
            "bool": {
              "must": [{
                "term": {
                  "post_processing.pp_zone": 0
                }
              }, {
                "range": {
                  "header.ended_at": {
                    "lt": "2017-03-08T00:00:00.000+01:00"
                  }
                }
              }]
            }
          },
          "aggregations": {
            "result": {
              "stats": {
                "script": "1"
              }
            }
          }
        },
        "unread_voicemails": {
          "filter": {
            "range": {
              "post_processing.unread_voice_messages": {
                "gt": 0
              }
            }
          },
          "aggregations": {
            "result": {
              "stats": {
                "script": "1"
              }
            }
          }
        }
      }
    }
  }
}

And here is the (slow) response:

{
  "took": 1089,
  "timed_out": false,
  "_shards": {
    "total": 549,
    "successful": 549,
    "failed": 0
  },
  "hits": {
    "total": 1061,
    "max_score": 0.0,
    "hits": []
  },
  "aggregations": {
    "per_interval": {
      "buckets": [{
        "key_as_string": "1970-01-01T00:00:00.000Z",
        "key": 0,
        "doc_count": 1061,
        "unprocessed_calls_for_more_than_1_day": {
          "doc_count": 1,
          "result": {
            "count": 1,
            "min": 1.0,
            "max": 1.0,
            "avg": 1.0,
            "sum": 1.0
          }
        },
        "unread_voicemails": {
          "doc_count": 1,
          "result": {
            "count": 1,
            "min": 1.0,
            "max": 1.0,
            "avg": 1.0,
            "sum": 1.0
          }
        }
      }]
    }
  }
}

On non-business hours, with a reduced CPU usage, the request takes "only" ~100ms.

So, here is my question... How do you think we could reduce the CPU usage?

Optimizing the query? (How?)
Reducing the number of shards?
Adding a fourth node or upgrading their RAM (currently 16 GB)?
Upgrading Elasticsearch?
Other?

Thanks a lot for your time and suggestions.

Christian_Dahlqvist · March 9, 2017, 2:24pm

That is quite small shards you have there, so when you query a long time period a large number of them will need to be queried. Given the size of those shards, you would probably be fine using weekly or even monthly indices instead of daily, which would lead to fewer, larger shards being queried, which should be more efficient.

As I have pointed out in another post today, we do recommend that you perform a shard-sizing exercise, as described in this video, in order to identify the best shard size for your data and queries.

RVillette · March 13, 2017, 2:40pm

Thanks a lot for your answer. We are going to do what you suggest.

jprante · March 13, 2017, 6:07pm

Do not execute range queries/filters on millisecond resolution fields, prefer seconds resolution if you want high performance.

system · April 10, 2017, 6:07pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.transport.TransportService Elasticsearch	25	8450	February 27, 2018
Rejected execution Elasticsearch	9	6276	August 11, 2020
ThreadPoolExecutor overused Elasticsearch	8	3405	July 5, 2017
Getting error org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.transport.TransportService$4@2397ca1d on EsThreadPoolExecutor[index, queue capacity = 200 Elasticsearch	2	4260	May 24, 2019
Elasticsearch cluster overloaded Elasticsearch	2	2653	March 7, 2018

Performance issue on some requests

Related topics