Elasticsearch cpu spike, search thread pool queues explode

bitsofinfo · November 8, 2018, 7:43pm

Hi not looking for a definitive answer here but a bit puzzled as to what is going on or some pointers on config settings to look at

Running ES 6.3

10 node cluster, 5 warm, 5 hot

Indicies:

logstash indexes, being rotated every few hours, indexing rate ~2500/s, 5 shards, 1 replica. Search rate ~1-3 searches a second
one unrelated app index ~300mb total in size 5 shards, 1 replica. Indexing rate 0-1 a second, search rate 1-2 searches a second

All requests (read/write) hit hot nodes

Every once in a while, all search queues (14 threads, 1000 queue size) on all hot nodes fill up and start rejecting requests. During this period of time when looking at the elasticsearch overall and per node metrics we see CPU spiking well over 100% on all nodes. Only after we stop indexing and/or reduce the number of searches going on does the cluster eventually slow down

There is nothing in the slow search logs on any node during this time period

The only thing we can see is that this tends to occur 5-10m after logstash stops writing to a logstash index, creates a new one and starts writing there.... perhaps while concurrently one or more searches are in process (impossible to tell) as we have zero insight into the search queue contents.

The same metrics do NOT show any increase in the search rate during this period, yet the search queues are maxed out.

This feels like some sort of blocking operation preventing searches is going on across the cluster during this period of time but cannot figure out what that would be. None of the known searches being executed take more than 500ms to max 3s to return typically. During this same period, watchers fail with timeouts as well.

Not really sure where to be begin looking. Any ideas appreciated.

Christian_Dahlqvist · November 8, 2018, 7:49pm

How many indices and shards do you have in the cluster? What is the average shard size? How many of these are on the hot nodes?

bitsofinfo · November 8, 2018, 7:50pm

All of the above indicies have one shard on each of the nodes described. Logstash indicies are many, but one one is actively written to.

Christian_Dahlqvist · November 8, 2018, 7:52pm

That is not what I asked. What is the output of the cluster health API?

bitsofinfo · November 8, 2018, 7:55pm

6.1TB total data
231 indexes
about 195 per hot node

{
"cluster_name" : "jiblet8",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 13,
"number_of_data_nodes" : 10,
"active_primary_shards" : 955,
"active_shards" : 1940,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}

Christian_Dahlqvist · November 8, 2018, 8:02pm

How many of these 1940 shards are on the hot nodes?

bitsofinfo · November 8, 2018, 8:02pm

~193 per node

bitsofinfo · November 9, 2018, 3:19pm

Any thoughts?

haifidelity · November 13, 2018, 3:30am

For what it's worth, we are experiencing the same thing from a cluster upgraded from 5.6.x to 6.3.1. No issues in 5.6, but start seeing the active search queue fill up then start rejecting subsequent queries after a few days. The health api responds, but all other API's and cluster responsiveness times out, requiring us to do a complete cluster restart.

wrinehart · November 14, 2018, 4:06pm

We are also seeing this same behavior on a cluster that was recently upgraded to 6.4.2. The search queue grows and grows until it hits the queue capacity and then the cluster locks up. The cluster responds to health checks but is otherwise unsearchable/indexable. Only a full cluster restart fixes this.

system · December 12, 2018, 4:18pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Increasing thread pool / queue size Elasticsearch	3	667	July 6, 2017
Elasticsearch cpu/load high with search thread pool queues high Elasticsearch	5	4037	December 28, 2018
Queue size Elasticsearch	6	684	July 6, 2017
Elasticsearch sizing and queue capacity Elasticsearch	7	19336	July 5, 2017
Thread pool and channels exploding Elasticsearch	5	1894	July 5, 2017

Elasticsearch cpu spike, search thread pool queues explode

Related topics