So our cluster consists of 4k indices and about 20k shards spanned over 7 nodes. upon restarting an elasticsearch node most of the indices(99.98) get to available state within 5 mins. but everytime there are a few indices which are greater than 100G get stuck in the recovery for hours. sometimes upto 12hrs.
about 100 indices are read/write indices and rest all are read only indices.
the write indices have 1 replica, but the read only indices have 0 replicas.
I suspect that at least part of the reason is that you have far too many shards for a cluster that size. Please read this blog post around shards and sharing practices, and then work to reduce this dramatically, e.g. by reindexing into fewer and larger indices.
When nodes restart their shard copies are subtracted from the count of copies. The cluster will wait a while for the node to come back, and, when it does, the node will try to use the copy of the shard that it has on disk as a real shard copy. To do that it need to make sure its shard copy has everything that happened while it was gone. It can do this by:
Syncing all of the files that hold the index from the primary to itself, using the files it has on the disk when they match. This isn't as terrible as it sounds, but it can be quite slow if there have been many writes since the last time it happened. This is almost certainly what is happening in your big indices. This was Elasticsearch's original recovery mechanism and is still used today when other mechanisms fail.
Relying a "synced flush" that marks that the state of an index, promising that all of the files on disk have all of the operations that the primary has. This flush is automatically applied when an index hasn't been written to for a while but can also be manually applied. It should make indices that are effectively read only recover mostly instantly. But if there are any writes to the shard while the node is down Elasticsearch has to fall back to copying files.
It should be possible to replay changes that happened while the node was away, but I've not been paying much attention to how implementing this is going. That'd allow faster recoveries even with changes so long as not too many changes happened while it was gone. You aren't getting that because this process is slow.
Have a look at the instructions for doing a rolling restart update. For the most part they should apply if you only want to restart a single node.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.