Optimizing our Elastic Stack?

We're starting to witness a heavy CPU Load on our stack and were hoping to get some help in understanding how we can lessen these. Our current topology is as follows.

Hosted on AWS EC2
5 Master/Data Nodes of M5x4x Large
2 Coodinating Nodes of M4x4x Large
Java heap set to less than 50% of max of 64GB
Total no of documents / day ! 100,000
Total no of documents overall ! 2 million

The cluster stats is extremely large so I've collated our Stats on this page:
Cluster Stats

The nodes stats are listed here:
Node Stats

The links to your stats are broken.

Indexing 100k documents per day sounds fairly light. Are you searching them very heavily?

How many indices do you have, and how many shards?

we have about 40 indices (rotating monthly) and around 250 Shards. Thats just the thing - we did some analysis in our logs and audit to see what could be causing the spikes. The only thing that could stand out was wildcard search in our indexe's free form field.
For example searching for "Susan" In a column called "conversation". But those are one-off spikes. We're seeing a continuous CPU Load of 70-80% on the Data Nodes which eventually brings down elastic (Red Indices state)

Cluster Stats:

Node Stats:

This should not be a resource-intensive search. One thing to try is to look at GET _nodes/hot_threads to see what the nodes are busy doing.

I can't see any nodes reporting CPU load of 70-80% in the stats you've shared, although the website you've used to share your stats seems to have a very broken in-page search function so it's hard to be sure. Can you name a specific node that you think has such high CPU in those stats?

Could you use something better-suited to sharing text documents like https://gist.github.com in future?

Normally when something "brings down Elastic" there are copious logs describing what went wrong. Have you looked at them? What do they say?

we dont have the current node stats. but when It hits those CPUs again - I'll post back here with both hot threads and node_stats. I'm curious - is the Master/Data Node structure we've chosen in the OP - does that sound like an optimal setup?

It looks like you have ~40GB of data and ~5M documents on each node, so 200GB and 20M documents in total. You're indexing 100k documents per day, and it's not clear what your search load is. In the stats you shared most of the nodes are using less than 10% CPU. It's hard to say for sure without knowing a lot more about your needs, but my initial impression is that this cluster is oversized for its current workload.

1 Like

thanks...I'll keep that in perspective for our coming iteration. Another question here is - would it make sense to separate our Master/Data Nodes into 2 Master and 5 Data Nodes (just to keep current BAU even if oversized as you've indicated) from a performance perspective?

I wouldn't recommend going below 3 master-eligible nodes, because you need at least three for fault tolerance. On busy or large clusters we recommend dedicated master nodes instead of mixed master/data nodes, but the threshold is not clearly defined. Really the only way to answer this kind of question is to perform some benchmarks of your own workload.

1 Like

So 3 X Master Node and 5 X Data Nodes to keep in line our current BAU?

I don't really understand what you're trying to achieve with this change.

Just to break out our existing 5 MASTER/DATA cluster into a MASTER specific and DATA specific re-architecture without reducing the number of DATA nodes

Your cluster seem oversized so I would recommend going with 3 master/data nodes. Just because you CAN have dedicated node types does not mean you SHOULD.


This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.