We have around 85 indices/10 shards per index and total of 30TB of data.
We have
3 master nodes
3 client nodes
18 data nodes ( with 3TB disk space and 64GB RAM. 32GB allocated to ES).
If I follow rolling restart process with disable indexing and sync flush, recovery is around 15 mins.
However if any node leave the cluster, and come back say due to network issue or any other issue, then recovery is > 3hours. (indexing is on)
I was monitoring the stats today and noticed that, initialisation of shards itself took 3 hours and there was no reallocation done.
My question is why re-initialisation from local node is taking > 3 hours? Is there any settings we are missing?
If you are indexing into all, or at least a large portion of the indices, synced flush will not help and the shards will need to be replaced, which probably explains the much longer recovery time.
So In case of network failure, with delay allocation to 5m,
If we find any data node left the cluster (using some monitoring tools) and stop indexing ( This would be after data node left the cluster.) would that help in recovery?
Basically stop indexing after node left the cluster would help in recovery?
I suspect the shards would still deviate, so am not sure that would help. If you had indices that you were not actively indexing into, those should recover faster. What type of data do you have in the cluster?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.