Why does it take time for an Elasticsearch node to go "green" after being restarted?

(Shaunak Kashyap) #1

Elasticsearch (ES) keeps checksums of each shard to make sure once a shard has been copied to a different node that the shard copy is valid.

However shard copies do diverge file wise, since after a shard copy is started, each shard merges segments independently, but for shard relocation that is fine.

After a full restart ES and the primary shards are started, the checksum of the replica shards is different then on the primary shard (because of the explanation mentioned above) and then instead of reusing the replica shard, it will make a copy of the primary shard and use that. This obviously takes time and that is why getting from yellow state to green state can take a long time.

Our plans are to improve this in a future release, so that ES can safely use the shard replica and shard copying doesn't need to occur for replica shards to get in a started state.

In the mean time you can increase the following setting the speed this up:

indices.recovery.max_bytes_per_sec (defaults to 20mb/s)
indices.recovery.concurrent_streams (defaults to 3)

These setting define the limits on a node level.

(system) #2