Hi not looking for a definitive answer here but a bit puzzled as to what is going on or some pointers on config settings to look at
Running ES 6.3
10 node cluster, 5 warm, 5 hot
logstash indexes, being rotated every few hours, indexing rate ~2500/s, 5 shards, 1 replica. Search rate ~1-3 searches a second
one unrelated app index ~300mb total in size 5 shards, 1 replica. Indexing rate 0-1 a second, search rate 1-2 searches a second
All requests (read/write) hit hot nodes
Every once in a while, all search queues (14 threads, 1000 queue size) on all hot nodes fill up and start rejecting requests. During this period of time when looking at the elasticsearch overall and per node metrics we see CPU spiking well over 100% on all nodes. Only after we stop indexing and/or reduce the number of searches going on does the cluster eventually slow down
There is nothing in the slow search logs on any node during this time period
The only thing we can see is that this tends to occur 5-10m after logstash stops writing to a logstash index, creates a new one and starts writing there.... perhaps while concurrently one or more searches are in process (impossible to tell) as we have zero insight into the search queue contents.
The same metrics do NOT show any increase in the search rate during this period, yet the search queues are maxed out.
This feels like some sort of blocking operation preventing searches is going on across the cluster during this period of time but cannot figure out what that would be. None of the known searches being executed take more than 500ms to max 3s to return typically. During this same period, watchers fail with timeouts as well.
Not really sure where to be begin looking. Any ideas appreciated.