Slow initialisation time after restart

We are facing the slow shard initialisation time. I went thru the post http://elasticsearch-users.115913.n3.nabble.com/Restarting-an-active-node-without-needing-to-recover-all-data-remotely-td4039346.html#a4039355

and it suggest Elasticsearch is going to do some improvement on slow restart process.

Did elastic search made any fix to improve it?

Which version of Elasticsearch are you on? How much data/indices/shards do you have in the cluster?

We are using Elastic search 2.4.2.

We have around 85 indices/10 shards per index and total of 30TB of data.
We have
3 master nodes
3 client nodes
18 data nodes ( with 3TB disk space and 64GB RAM. 32GB allocated to ES).

If I follow rolling restart process with disable indexing and sync flush, recovery is around 15 mins.
However if any node leave the cluster, and come back say due to network issue or any other issue, then recovery is > 3hours. (indexing is on)

I was monitoring the stats today and noticed that, initialisation of shards itself took 3 hours and there was no reallocation done.

My question is why re-initialisation from local node is taking > 3 hours? Is there any settings we are missing?

Are you actively indexing into all of these indices? Do you stop indexing when you perform a rolling restart?

While rolling restart we stop indexing.
But in case of network failure if any data node leave cluster we don't stop indexing.

If you are indexing into all, or at least a large portion of the indices, synced flush will not help and the shards will need to be replaced, which probably explains the much longer recovery time.

So In case of network failure, with delay allocation to 5m,

If we find any data node left the cluster (using some monitoring tools) and stop indexing ( This would be after data node left the cluster.) would that help in recovery?

Basically stop indexing after node left the cluster would help in recovery?

I suspect the shards would still deviate, so am not sure that would help. If you had indices that you were not actively indexing into, those should recover faster. What type of data do you have in the cluster?

We have mutable data. and all indices are always active, as new documents from customers are indexed and old documents are updated frequently.

I suspected that may be the case, and am afraid I do not have any good suggestions. Maybe someone else in the community may have some suggestions?

Looking forward to get more suggestion.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.