Hi not looking for a definitive answer here but a bit puzzled as to what is going on or some pointers on config settings to look at
Running ES 6.3
10 node cluster, 5 warm, 5 hot
Indicies:
logstash indexes, being rotated every few hours, indexing rate ~2500/s, 5 shards, 1 replica. Search rate ~1-3 searches a second
one unrelated app index ~300mb total in size 5 shards, 1 replica. Indexing rate 0-1 a second, search rate 1-2 searches a second
All requests (read/write) hit hot nodes
Every once in a while, all search queues (14 threads, 1000 queue size) on all hot nodes fill up and start rejecting requests. During this period of time when looking at the elasticsearch overall and per node metrics we see CPU spiking well over 100% on all nodes. Only after we stop indexing and/or reduce the number of searches going on does the cluster eventually slow down
There is nothing in the slow search logs on any node during this time period
The only thing we can see is that this tends to occur 5-10m after logstash stops writing to a logstash index, creates a new one and starts writing there.... perhaps while concurrently one or more searches are in process (impossible to tell) as we have zero insight into the search queue contents.
The same metrics do NOT show any increase in the search rate during this period, yet the search queues are maxed out.
This feels like some sort of blocking operation preventing searches is going on across the cluster during this period of time but cannot figure out what that would be. None of the known searches being executed take more than 500ms to max 3s to return typically. During this same period, watchers fail with timeouts as well.
Not really sure where to be begin looking. Any ideas appreciated.
For what it's worth, we are experiencing the same thing from a cluster upgraded from 5.6.x to 6.3.1. No issues in 5.6, but start seeing the active search queue fill up then start rejecting subsequent queries after a few days. The health api responds, but all other API's and cluster responsiveness times out, requiring us to do a complete cluster restart.
We are also seeing this same behavior on a cluster that was recently upgraded to 6.4.2. The search queue grows and grows until it hits the queue capacity and then the cluster locks up. The cluster responds to health checks but is otherwise unsearchable/indexable. Only a full cluster restart fixes this.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.