Cluster turning into green state takes long time

Hi Folks,

Currently facing issue in the elasticsearch cluster
During the upgrade/restart of the nodes in the cluster ,it takes long time to turn the cluster into green state from yellow.

I am following the below steps for restart

  1. Disable shard allocation
  2. restart the node
  3. enable shard allocation

after following the above steps it takes very long time to turn the cluster to green state from yellow.

Version of ES : 5.X

Please advise .

How large is the cluster? How much data is there in it? How many indices and shards do you have?

For the best recovery time you should follow the instructions for a rolling upgrade (except the upgrade bit obviously, but including all the optional steps).

"5.x" is not a useful version number. It is always better to share the full version number. Things changed a lot between 5.0.0 and 5.6.16. One thing they all have in common is that they are long past the ends of their supported lives and you should definitely upgrade as soon as possible. In particular there are improvements to recovery speed in later versions.

@Christian_Dahlqvist Cluster is large with 20 data nodes and each has total disk = 1.7 TB out of which ~500GB is used.
Each has 20 shards on them.

@DavidTurner Current version is 5.1.1-1

I am not an expert, but if you have 20 nodes with lots of data it will take time to connect. Also, ES will take approx 20 seconds to start properly on a new cluster so if you have lots of data associated, it may take some time.

What does Kibana monitoring overview shard activity show during the recovery time? Do all indices have at least 1 replica? Any force awareness?

When the node is stopped, all replicas there are "lost", for all that were primaries, a replica will be promoted to primary. When the node is restarted, recovery will start for the missing shards, I think this is where newer versions have changed how fast this happens. For indices that haven't changed during the outage interval, recovery on current versions is pretty fast. For indices with a lot of changes, recovery is slow. Unless you set prioritizes on indices, this is pretty random. There are limits on concurrent shard recoveries, I think the default is 2. If 2 indices with a lot of changes start first, the rest will wait behind them.

If you enable shard allocation too soon, will the cluster start doing recoveries to other nodes? I suspect it will.

"Lost" is putting it a bit strongly. The data remains on disk and will be re-used in a recovery if possible, saving a good deal of time, particularly if you follow the complete instructions for a rolling upgrade that I mentioned above.

Newer versions expand the scenarios under which Elasticsearch can re-use any existing data, and work continues on this front today, but even in versions as old as 5.1 most unchanged shards should recover pretty much instantly AFAIK.

Yea, my mind knew what I was trying to say, it just didn't make it to my fingers :slight_smile:

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.