Elasticsearch cpu spike, search thread pool queues explode

Hi not looking for a definitive answer here but a bit puzzled as to what is going on or some pointers on config settings to look at

Running ES 6.3

10 node cluster, 5 warm, 5 hot


  • logstash indexes, being rotated every few hours, indexing rate ~2500/s, 5 shards, 1 replica. Search rate ~1-3 searches a second

  • one unrelated app index ~300mb total in size 5 shards, 1 replica. Indexing rate 0-1 a second, search rate 1-2 searches a second

All requests (read/write) hit hot nodes

Every once in a while, all search queues (14 threads, 1000 queue size) on all hot nodes fill up and start rejecting requests. During this period of time when looking at the elasticsearch overall and per node metrics we see CPU spiking well over 100% on all nodes. Only after we stop indexing and/or reduce the number of searches going on does the cluster eventually slow down

There is nothing in the slow search logs on any node during this time period

The only thing we can see is that this tends to occur 5-10m after logstash stops writing to a logstash index, creates a new one and starts writing there.... perhaps while concurrently one or more searches are in process (impossible to tell) as we have zero insight into the search queue contents.

The same metrics do NOT show any increase in the search rate during this period, yet the search queues are maxed out.

This feels like some sort of blocking operation preventing searches is going on across the cluster during this period of time but cannot figure out what that would be. None of the known searches being executed take more than 500ms to max 3s to return typically. During this same period, watchers fail with timeouts as well.

Not really sure where to be begin looking. Any ideas appreciated.

How many indices and shards do you have in the cluster? What is the average shard size? How many of these are on the hot nodes?

All of the above indicies have one shard on each of the nodes described. Logstash indicies are many, but one one is actively written to.

That is not what I asked. What is the output of the cluster health API?

6.1TB total data
231 indexes
about 195 per hot node

"cluster_name" : "jiblet8",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 13,
"number_of_data_nodes" : 10,
"active_primary_shards" : 955,
"active_shards" : 1940,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0

How many of these 1940 shards are on the hot nodes?

~193 per node

Any thoughts?

For what it's worth, we are experiencing the same thing from a cluster upgraded from 5.6.x to 6.3.1. No issues in 5.6, but start seeing the active search queue fill up then start rejecting subsequent queries after a few days. The health api responds, but all other API's and cluster responsiveness times out, requiring us to do a complete cluster restart.

We are also seeing this same behavior on a cluster that was recently upgraded to 6.4.2. The search queue grows and grows until it hits the queue capacity and then the cluster locks up. The cluster responds to health checks but is otherwise unsearchable/indexable. Only a full cluster restart fixes this.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.