Adding and removing a node fixed our latency issues

Dear all,

I'm observing a very strange behaviour in my Elasticsearch cluster. It has ~35 nodes and stores dozens of Terabytes of data. We are running ES 5.5.3.

We recently recreated all of our warm nodes so that they use local SSDs instead of network backed storage. After the migration we observed elevated latency (up to x5), very slow garbage collections (multiple seconds) and higher CPU usage for a few days.

The process was as follow:

  • We add 30 new nodes with SSDs in the cluster (effectively doubling its size)
  • We exclude the old nodes from shard allocation
  • ES moves all data
  • We shut down the old nodes

This afternoon, I added a new node in the cluster to do some tests because I wanted to understand while the cluster was overall slower (with better disks!).
I didn't want to have data on this new node so I excluded it from allocation before hand. After it was configured by our Ansible playbook I stopped the ES service on it almost immediately.

A few seconds after, without other action on my side, all our metrics came back to their previous baseline.

Here you can see the latency falling:
image
And here you can see the CPU usage going down, the purple line at the bottom is the CPU usage of the new node. The moment the average usage falls matches almost exactly the moment the node was configured to join the cluster:
image

So basically, adding a new node in our cluster for less than a minute fixed a cluster-wide issue that had been puzzling us for days.

I'm glad it's fixed, but also extremely curious at what might have happened... Any idea?

Thanks!

That's very much past EOL and no longer supported, you need to upgrade as a matter of urgency.

Given it's age, and the amount of work that has been done to the core of Elasticsearch since 5.X, I don't think you will be able to find an answer to this unfortunately.

Thanks for your answer! Upgrading the cluster is definitely the very next step for us, but I was curious whether this was a known behavior or not...

Thanks again!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.