Hi,
With Elasticsearch 2.2.1, we currently have, in production, a cluster with 3 nodes, and their CPU usage is reaching 80% most of the day.
When the CPU usage is too high, our application regularly gets an error message like this one:
Elasticsearch::Transport::Transport::ServerError ([429] {"error":
{"root_cause":[
{"type":"es_rejected_execution_exception",
"reason":"rejected execution of org.elasticsearch.transport.TransportService$4@7be66fbe on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@7fce5848[Running, pool size = 7, active threads = 7, queued tasks = 1000, completed tasks = 2158260596]]"}, ...
By using the thread_pool request, we see that we have a lot of search.rejected occurrences:
$ curl 'es_node_1:9200/_cat/thread_pool?v'
host ip bulk.active bulk.queue bulk.rejected index.active index.queue index.rejected search.active search.queue search.rejected
10.160.79.148 10.160.79.148 0 0 0 0 0 0 7 306 11352252
10.160.79.146 10.160.79.146 0 0 0 0 0 0 7 311 12758498
10.160.79.147 10.160.79.147 0 0 0 1 0 0 7 109 14826492
We have one index for every day, 3 primary shards and 3 replicas. Size used for one index:
$ curl -s 'es_node_1:9200/_cat/shards' | grep "analytics-2017-03-07"
analytics-2017-03-07 2 r STARTED 185141 133.6mb 10.160.79.146 es_node_1
analytics-2017-03-07 2 p STARTED 185141 117.7mb 10.160.79.148 es_node_2
analytics-2017-03-07 1 p STARTED 185135 117.9mb 10.160.79.147 es_node_3
analytics-2017-03-07 1 r STARTED 185135 144.8mb 10.160.79.148 es_node_2
analytics-2017-03-07 0 p STARTED 188470 118.4mb 10.160.79.146 es_node_1
analytics-2017-03-07 0 r STARTED 188470 117.4mb 10.160.79.147 es_node_3
After further investigation, we found the slowest query (which is executed about once every second, with some variations in the attribute values). It is run on the documents from the 6 last months.
Here is the query run manually:
$ curl -s -XPOST 'es_node_1:9200/analytics-2016-09-2%2A,analytics-2016-09-30,analytics-2016-10-%2A,analytics-2016-11-%2A,analytics-2016-12-%2A,analytics-2017-01-%2A,analytics-2017-02-%2A,analytics-2017-03-0%2A/cdr/_search?ignore_unavailable=true' --data @/tmp/search.json
The content of search.json:
{
"size": 0,
"query": {
"bool": {
"filter": {
"bool": {
"must": [{
"range": {
"header.started_at": {
"gte": "2016-09-08T22:00:00.000Z",
"lte": "2017-03-09T09:33:38.989Z"
}
}
}, {
"bool": {
"should": [{
"term": {
"header.registration_key": "4dbfc6c0-eead-0133-0530-0050569746d9"
}
}, {
"term": {
"header.registration_key": "f0a44c10-eeaf-0133-0531-0050569746d9"
}
}, {
"term": {
"header.registration_key": "5dd52440-f00c-0133-347c-00505697369d"
}
}]
}
}],
"should": [{
"term": {
"header.called_entity.uuid": "35070630-60b6-0134-40cb-00505697111b"
}
}, {
"term": {
"header.coverage_entity.uuid": "35070630-60b6-0134-40cb-00505697111b"
}
}]
}
}
}
},
"aggregations": {
"per_interval": {
"date_histogram": {
"field": "header.started_at",
"interval": "10000w",
"extended_bounds": {
"min": "2016-09-08T22:00:00.000Z",
"max": "2017-03-09T09:33:38.989Z"
}
},
"aggregations": {
"unprocessed_calls_for_more_than_1_day": {
"filter": {
"bool": {
"must": [{
"term": {
"post_processing.pp_zone": 0
}
}, {
"range": {
"header.ended_at": {
"lt": "2017-03-08T00:00:00.000+01:00"
}
}
}]
}
},
"aggregations": {
"result": {
"stats": {
"script": "1"
}
}
}
},
"unread_voicemails": {
"filter": {
"range": {
"post_processing.unread_voice_messages": {
"gt": 0
}
}
},
"aggregations": {
"result": {
"stats": {
"script": "1"
}
}
}
}
}
}
}
}
And here is the (slow) response:
{
"took": 1089,
"timed_out": false,
"_shards": {
"total": 549,
"successful": 549,
"failed": 0
},
"hits": {
"total": 1061,
"max_score": 0.0,
"hits": []
},
"aggregations": {
"per_interval": {
"buckets": [{
"key_as_string": "1970-01-01T00:00:00.000Z",
"key": 0,
"doc_count": 1061,
"unprocessed_calls_for_more_than_1_day": {
"doc_count": 1,
"result": {
"count": 1,
"min": 1.0,
"max": 1.0,
"avg": 1.0,
"sum": 1.0
}
},
"unread_voicemails": {
"doc_count": 1,
"result": {
"count": 1,
"min": 1.0,
"max": 1.0,
"avg": 1.0,
"sum": 1.0
}
}
}]
}
}
}
On non-business hours, with a reduced CPU usage, the request takes "only" ~100ms.
So, here is my question... How do you think we could reduce the CPU usage?
- Optimizing the query? (How?)
- Reducing the number of shards?
- Adding a fourth node or upgrading their RAM (currently 16 GB)?
- Upgrading Elasticsearch?
- Other?
Thanks a lot for your time and suggestions.