Background: Because the business scenario does not allow downtime maintenance and there will be a large number of read and write operations during the period, only rolling installation of plug-ins can be considered
The version is 6.8.1, 3 master nodes, 30 data nodes, index 1 replica configuration
I understand that after each node is restarted, it needs to be restored to cluster.routing.allocation.enable=all; to avoid that the replica cannot catch up with the actual primary shard data after the restart, resulting in the situation that there is no latest data shard during the rolling restart; is this understanding correct?
This version is ancient - you need to upgrade to a supported version as a matter of some urgency. You can do this upgrade in a rolling fashion, first to 7.17.29 and then again to 8.19.14 which is a currently-supported version. All currently-supported versions have built-in repository functionality, no need to install any plugins.
Thanks for sharing the detailed background. I've dealt with similar constraints on legacy clusters (no downtime, large read/write volume).
To answer your specific question about cluster.routing.allocation.enable=all:
Yes, your understanding is essentially correct. During a rolling restart, you typically want to:
Disable allocation temporarily (none) to prevent shard thrashing.
Restart the node.
Set back to all so the cluster rebalances gradually.
However, on 6.8.1, you also need to watch for replica sync delays after each node returns. The primary shard may have new writes during the node's downtime, and the replica can take time to catch up. Setting all too early can cause routing decisions based on stale replica metadata.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.