We're running a cluster with the following configuration:
6 data nodes (each has 31G RAM allocated to ES heap, and 16 vCPU - m4.4xlarge on AWS)
1 main index (others are very small)
21 shards
1 replica
~600,000,000 documents, and;
800G of data
We are running thousands of queries per second and continuously experiencing high CPU & load (almost 100%) causing EsRejectedExecutionException every few minutes.
We are using ES v2.3.4.
What would you recommend to do in this case (without modifying the ES version or increasing the number of indices)?
Would increasing the CPU or the number of nodes be helpful?
I would recommend checking out the search slowlog to see which queries are taking a long time, from there, I think you may want to see if you can optimize those so that they don't take as long to execute in the queue.
Would increasing the CPU or the number of nodes be helpful?
This would probably help (I can't tell for certain without knowing exactly why the queries are filling the queue), so if you need a faster solution, you can almost always add resources to help with it.
Another thing I would recommend is to try and upgrade if you can, we've made a lot of improvements since 2.3.4, in particular, one that might help you is Adaptive Replica Selection: Search APIs | Elasticsearch Guide [6.4] | Elastic
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.