Optimizing our Elastic Stack?

titan1978 · March 8, 2019, 8:36pm

We're starting to witness a heavy CPU Load on our stack and were hoping to get some help in understanding how we can lessen these. Our current topology is as follows.

Hosted on AWS EC2
5 Master/Data Nodes of M5x4x Large
2 Coodinating Nodes of M4x4x Large
Java heap set to less than 50% of max of 64GB
Total no of documents / day ! 100,000
Total no of documents overall ! 2 million

The cluster stats is extremely large so I've collated our Stats on this page:
Cluster Stats

The nodes stats are listed here:
Node Stats

DavidTurner · March 9, 2019, 10:20am

The links to your stats are broken.

Indexing 100k documents per day sounds fairly light. Are you searching them very heavily?

How many indices do you have, and how many shards?

titan1978 · March 9, 2019, 11:35am

we have about 40 indices (rotating monthly) and around 250 Shards. Thats just the thing - we did some analysis in our logs and audit to see what could be causing the spikes. The only thing that could stand out was wildcard search in our indexe's free form field.
For example searching for "Susan" In a column called "conversation". But those are one-off spikes. We're seeing a continuous CPU Load of 70-80% on the Data Nodes which eventually brings down elastic (Red Indices state)

Cluster Stats:

Node Stats:

DavidTurner · March 9, 2019, 12:11pm

This should not be a resource-intensive search. One thing to try is to look at GET _nodes/hot_threads to see what the nodes are busy doing.

I can't see any nodes reporting CPU load of 70-80% in the stats you've shared, although the website you've used to share your stats seems to have a very broken in-page search function so it's hard to be sure. Can you name a specific node that you think has such high CPU in those stats?

Could you use something better-suited to sharing text documents like https://gist.github.com in future?

Normally when something "brings down Elastic" there are copious logs describing what went wrong. Have you looked at them? What do they say?

titan1978 · March 11, 2019, 1:27pm

we dont have the current node stats. but when It hits those CPUs again - I'll post back here with both hot threads and node_stats. I'm curious - is the Master/Data Node structure we've chosen in the OP - does that sound like an optimal setup?

DavidTurner · March 11, 2019, 1:44pm

It looks like you have ~40GB of data and ~5M documents on each node, so 200GB and 20M documents in total. You're indexing 100k documents per day, and it's not clear what your search load is. In the stats you shared most of the nodes are using less than 10% CPU. It's hard to say for sure without knowing a lot more about your needs, but my initial impression is that this cluster is oversized for its current workload.

titan1978 · March 11, 2019, 1:47pm

thanks...I'll keep that in perspective for our coming iteration. Another question here is - would it make sense to separate our Master/Data Nodes into 2 Master and 5 Data Nodes (just to keep current BAU even if oversized as you've indicated) from a performance perspective?

DavidTurner · March 11, 2019, 5:01pm

I wouldn't recommend going below 3 master-eligible nodes, because you need at least three for fault tolerance. On busy or large clusters we recommend dedicated master nodes instead of mixed master/data nodes, but the threshold is not clearly defined. Really the only way to answer this kind of question is to perform some benchmarks of your own workload.

titan1978 · March 11, 2019, 5:38pm

Capital!
So 3 X Master Node and 5 X Data Nodes to keep in line our current BAU?

DavidTurner · March 11, 2019, 6:22pm

I don't really understand what you're trying to achieve with this change.

titan1978 · March 11, 2019, 6:24pm

Just to break out our existing 5 MASTER/DATA cluster into a MASTER specific and DATA specific re-architecture without reducing the number of DATA nodes

Christian_Dahlqvist · March 11, 2019, 6:30pm

Your cluster seem oversized so I would recommend going with 3 master/data nodes. Just because you CAN have dedicated node types does not mean you SHOULD.

system · April 8, 2019, 6:30pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Data node high CPU Elasticsearch	19	3714	February 26, 2018
Very high CPU usage on one Elasticsearch data node Elasticsearch	18	33841	May 9, 2018
High CPU Usage and Load on Data Nodes Elasticsearch	4	2191	December 14, 2019
Elasticsearch high CPU Utilization Elasticsearch	4	2314	November 14, 2019
Elastic master node high cpu Elasticsearch	11	4107	May 15, 2020

Optimizing our Elastic Stack?

Related topics