Dear all,
I'm observing a very strange behaviour in my Elasticsearch cluster. It has ~35 nodes and stores dozens of Terabytes of data. We are running ES 5.5.3.
We recently recreated all of our warm nodes so that they use local SSDs instead of network backed storage. After the migration we observed elevated latency (up to x5), very slow garbage collections (multiple seconds) and higher CPU usage for a few days.
The process was as follow:
- We add 30 new nodes with SSDs in the cluster (effectively doubling its size)
- We exclude the old nodes from shard allocation
- ES moves all data
- We shut down the old nodes
This afternoon, I added a new node in the cluster to do some tests because I wanted to understand while the cluster was overall slower (with better disks!).
I didn't want to have data on this new node so I excluded it from allocation before hand. After it was configured by our Ansible playbook I stopped the ES service on it almost immediately.
A few seconds after, without other action on my side, all our metrics came back to their previous baseline.
Here you can see the latency falling:
And here you can see the CPU usage going down, the purple line at the bottom is the CPU usage of the new node. The moment the average usage falls matches almost exactly the moment the node was configured to join the cluster:
So basically, adding a new node in our cluster for less than a minute fixed a cluster-wide issue that had been puzzling us for days.
I'm glad it's fixed, but also extremely curious at what might have happened... Any idea?
Thanks!