We're starting to witness a heavy CPU Load on our stack and were hoping to get some help in understanding how we can lessen these. Our current topology is as follows.
Hosted on AWS EC2
5 Master/Data Nodes of M5x4x Large
2 Coodinating Nodes of M4x4x Large
Java heap set to less than 50% of max of 64GB
Total no of documents / day ! 100,000
Total no of documents overall ! 2 million
The cluster stats is extremely large so I've collated our Stats on this page: Cluster Stats
we have about 40 indices (rotating monthly) and around 250 Shards. Thats just the thing - we did some analysis in our logs and audit to see what could be causing the spikes. The only thing that could stand out was wildcard search in our indexe's free form field.
For example searching for "Susan" In a column called "conversation". But those are one-off spikes. We're seeing a continuous CPU Load of 70-80% on the Data Nodes which eventually brings down elastic (Red Indices state)
This should not be a resource-intensive search. One thing to try is to look at GET _nodes/hot_threads to see what the nodes are busy doing.
I can't see any nodes reporting CPU load of 70-80% in the stats you've shared, although the website you've used to share your stats seems to have a very broken in-page search function so it's hard to be sure. Can you name a specific node that you think has such high CPU in those stats?
Could you use something better-suited to sharing text documents like https://gist.github.com in future?
Normally when something "brings down Elastic" there are copious logs describing what went wrong. Have you looked at them? What do they say?
we dont have the current node stats. but when It hits those CPUs again - I'll post back here with both hot threads and node_stats. I'm curious - is the Master/Data Node structure we've chosen in the OP - does that sound like an optimal setup?
It looks like you have ~40GB of data and ~5M documents on each node, so 200GB and 20M documents in total. You're indexing 100k documents per day, and it's not clear what your search load is. In the stats you shared most of the nodes are using less than 10% CPU. It's hard to say for sure without knowing a lot more about your needs, but my initial impression is that this cluster is oversized for its current workload.
thanks...I'll keep that in perspective for our coming iteration. Another question here is - would it make sense to separate our Master/Data Nodes into 2 Master and 5 Data Nodes (just to keep current BAU even if oversized as you've indicated) from a performance perspective?
I wouldn't recommend going below 3 master-eligible nodes, because you need at least three for fault tolerance. On busy or large clusters we recommend dedicated master nodes instead of mixed master/data nodes, but the threshold is not clearly defined. Really the only way to answer this kind of question is to perform some benchmarks of your own workload.
Just to break out our existing 5 MASTER/DATA cluster into a MASTER specific and DATA specific re-architecture without reducing the number of DATA nodes
Your cluster seem oversized so I would recommend going with 3 master/data nodes. Just because you CAN have dedicated node types does not mean you SHOULD.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.