Performance issue on some requests

Hi,

With Elasticsearch 2.2.1, we currently have, in production, a cluster with 3 nodes, and their CPU usage is reaching 80% most of the day.
When the CPU usage is too high, our application regularly gets an error message like this one:

Elasticsearch::Transport::Transport::ServerError ([429] {"error":
   {"root_cause":[
     {"type":"es_rejected_execution_exception",
      "reason":"rejected execution of org.elasticsearch.transport.TransportService$4@7be66fbe on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@7fce5848[Running, pool size = 7, active threads = 7, queued tasks = 1000, completed tasks = 2158260596]]"}, ...

By using the thread_pool request, we see that we have a lot of search.rejected occurrences:

$ curl 'es_node_1:9200/_cat/thread_pool?v'
host          ip            bulk.active bulk.queue bulk.rejected index.active index.queue index.rejected search.active search.queue search.rejected 
10.160.79.148 10.160.79.148           0          0             0            0           0              0             7          306        11352252 
10.160.79.146 10.160.79.146           0          0             0            0           0              0             7          311        12758498 
10.160.79.147 10.160.79.147           0          0             0            1           0              0             7          109        14826492 

We have one index for every day, 3 primary shards and 3 replicas. Size used for one index:

$ curl -s 'es_node_1:9200/_cat/shards' | grep "analytics-2017-03-07"
analytics-2017-03-07  2 r STARTED 185141  133.6mb 10.160.79.146 es_node_1
analytics-2017-03-07  2 p STARTED 185141  117.7mb 10.160.79.148 es_node_2
analytics-2017-03-07  1 p STARTED 185135  117.9mb 10.160.79.147 es_node_3
analytics-2017-03-07  1 r STARTED 185135  144.8mb 10.160.79.148 es_node_2
analytics-2017-03-07  0 p STARTED 188470  118.4mb 10.160.79.146 es_node_1
analytics-2017-03-07  0 r STARTED 188470  117.4mb 10.160.79.147 es_node_3 

After further investigation, we found the slowest query (which is executed about once every second, with some variations in the attribute values). It is run on the documents from the 6 last months.

Here is the query run manually:

$ curl -s -XPOST 'es_node_1:9200/analytics-2016-09-2%2A,analytics-2016-09-30,analytics-2016-10-%2A,analytics-2016-11-%2A,analytics-2016-12-%2A,analytics-2017-01-%2A,analytics-2017-02-%2A,analytics-2017-03-0%2A/cdr/_search?ignore_unavailable=true' --data @/tmp/search.json

The content of search.json:

{
  "size": 0,
  "query": {
    "bool": {
      "filter": {
        "bool": {
          "must": [{
            "range": {
              "header.started_at": {
                "gte": "2016-09-08T22:00:00.000Z",
                "lte": "2017-03-09T09:33:38.989Z"
              }
            }
          }, {
            "bool": {
              "should": [{
                "term": {
                  "header.registration_key": "4dbfc6c0-eead-0133-0530-0050569746d9"
                }
              }, {
                "term": {
                  "header.registration_key": "f0a44c10-eeaf-0133-0531-0050569746d9"
                }
              }, {
                "term": {
                  "header.registration_key": "5dd52440-f00c-0133-347c-00505697369d"
                }
              }]
            }
          }],
          "should": [{
            "term": {
              "header.called_entity.uuid": "35070630-60b6-0134-40cb-00505697111b"
            }
          }, {
            "term": {
              "header.coverage_entity.uuid": "35070630-60b6-0134-40cb-00505697111b"
            }
          }]
        }
      }
    }
  },
  "aggregations": {
    "per_interval": {
      "date_histogram": {
        "field": "header.started_at",
        "interval": "10000w",
        "extended_bounds": {
          "min": "2016-09-08T22:00:00.000Z",
          "max": "2017-03-09T09:33:38.989Z"
        }
      },
      "aggregations": {
        "unprocessed_calls_for_more_than_1_day": {
          "filter": {
            "bool": {
              "must": [{
                "term": {
                  "post_processing.pp_zone": 0
                }
              }, {
                "range": {
                  "header.ended_at": {
                    "lt": "2017-03-08T00:00:00.000+01:00"
                  }
                }
              }]
            }
          },
          "aggregations": {
            "result": {
              "stats": {
                "script": "1"
              }
            }
          }
        },
        "unread_voicemails": {
          "filter": {
            "range": {
              "post_processing.unread_voice_messages": {
                "gt": 0
              }
            }
          },
          "aggregations": {
            "result": {
              "stats": {
                "script": "1"
              }
            }
          }
        }
      }
    }
  }
}

And here is the (slow) response:

{
  "took": 1089,
  "timed_out": false,
  "_shards": {
    "total": 549,
    "successful": 549,
    "failed": 0
  },
  "hits": {
    "total": 1061,
    "max_score": 0.0,
    "hits": []
  },
  "aggregations": {
    "per_interval": {
      "buckets": [{
        "key_as_string": "1970-01-01T00:00:00.000Z",
        "key": 0,
        "doc_count": 1061,
        "unprocessed_calls_for_more_than_1_day": {
          "doc_count": 1,
          "result": {
            "count": 1,
            "min": 1.0,
            "max": 1.0,
            "avg": 1.0,
            "sum": 1.0
          }
        },
        "unread_voicemails": {
          "doc_count": 1,
          "result": {
            "count": 1,
            "min": 1.0,
            "max": 1.0,
            "avg": 1.0,
            "sum": 1.0
          }
        }
      }]
    }
  }
}

On non-business hours, with a reduced CPU usage, the request takes "only" ~100ms.

So, here is my question... How do you think we could reduce the CPU usage?

  • Optimizing the query? (How?)
  • Reducing the number of shards?
  • Adding a fourth node or upgrading their RAM (currently 16 GB)?
  • Upgrading Elasticsearch?
  • Other?

Thanks a lot for your time and suggestions.

That is quite small shards you have there, so when you query a long time period a large number of them will need to be queried. Given the size of those shards, you would probably be fine using weekly or even monthly indices instead of daily, which would lead to fewer, larger shards being queried, which should be more efficient.

As I have pointed out in another post today, we do recommend that you perform a shard-sizing exercise, as described in this video, in order to identify the best shard size for your data and queries.

Thanks a lot for your answer. We are going to do what you suggest.

Do not execute range queries/filters on millisecond resolution fields, prefer seconds resolution if you want high performance.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.