Hi, might be a silly question, but did you wait for the restarted node to rejoin the cluster and be available before you cluster.routing.allocation.enable "all"?
I have had no major problems doing rolling restarts on my cluster with 20 nodes and quite a few TB of data. Largest shards have been 200GB.
My cluster is "rack aware" so I set cluster.routing.allocation.enable "none", restart ES on all nodes in one "rack", wait until they are all showing up again in the list of nodes and then cluster.routing.allocation.enable "all". It usually takes less than a minute for the cluster to go from yellow back to green.
Not sure if introducing new nodes causes some sort of re-balancing of shards....
Thank for advice A_B
Restarted node3 10 min ago.
After node3 joined cluster waited extra 3 min.
Actually INITIALIZING is taking time.
The replica shards on node3 are all started.
The primary shards from node3 are now INITIALIZING as replica on node3 (two at the time). During the process network speed is max on node3 coming from node that now have primary shards.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.