Adding and removing a node fixed our latency issues

alsyia · December 6, 2021, 2:57pm

Dear all,

I'm observing a very strange behaviour in my Elasticsearch cluster. It has ~35 nodes and stores dozens of Terabytes of data. We are running ES 5.5.3.

We recently recreated all of our warm nodes so that they use local SSDs instead of network backed storage. After the migration we observed elevated latency (up to x5), very slow garbage collections (multiple seconds) and higher CPU usage for a few days.

The process was as follow:

We add 30 new nodes with SSDs in the cluster (effectively doubling its size)
We exclude the old nodes from shard allocation
ES moves all data
We shut down the old nodes

This afternoon, I added a new node in the cluster to do some tests because I wanted to understand while the cluster was overall slower (with better disks!).
I didn't want to have data on this new node so I excluded it from allocation before hand. After it was configured by our Ansible playbook I stopped the ES service on it almost immediately.

A few seconds after, without other action on my side, all our metrics came back to their previous baseline.

Here you can see the latency falling:

And here you can see the CPU usage going down, the purple line at the bottom is the CPU usage of the new node. The moment the average usage falls matches almost exactly the moment the node was configured to join the cluster:

So basically, adding a new node in our cluster for less than a minute fixed a cluster-wide issue that had been puzzling us for days.

I'm glad it's fixed, but also extremely curious at what might have happened... Any idea?

Thanks!

warkolm · December 6, 2021, 9:53pm

That's very much past EOL and no longer supported, you need to upgrade as a matter of urgency.

Given it's age, and the amount of work that has been done to the core of Elasticsearch since 5.X, I don't think you will be able to find an answer to this unfortunately.

alsyia · December 7, 2021, 2:14pm

Thanks for your answer! Upgrading the cluster is definitely the very next step for us, but I was curious whether this was a known behavior or not...

Thanks again!

system · January 4, 2022, 2:15pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
CPU usages 90% and ES hotthreads dump Elasticsearch	2	461	July 6, 2017
CPU and Garbage Collection Time Reguraily Spike For Long Periods of Time Elasticsearch	4	977	November 18, 2019
ES slower after memory upgrade Elasticsearch	4	625	September 7, 2017
Data nodes being ignored by ES cluster for shard allocation Elasticsearch	5	416	January 2, 2020
Latency and CPU spike on all nodes simultaneously Elasticsearch	1	641	February 17, 2017

Adding and removing a node fixed our latency issues

Related topics