I have an issue where sometimes during a rolling restart when it gets to a node that has a primary replica once the node is offline the replica shards go into a PRIMARY_FAILED
state.
i.e
my-index 11 p UNASSIGNED NODE_LEFT
my-index 11 r UNASSIGNED PRIMARY_FAILED
my-index 11 r UNASSIGNED PRIMARY_FAILED
This doesn't seem to happen all the time and I can't really find a way to make it consistently happen.
According to the documentation this means The shard was initializing as a replica, but the primary shard failed before the initialization completed.
How do I prevent this? I am restarting one node at a time and wiating for all shards to be allocated and have the cluster in a green state before moving onto the next node. Shard allocation is turned off before each node is taken down and turned back on when brought online.
I can't really find any documentation that says how to prevent this and I am following all the steps here Full-cluster restart and rolling restart | Elasticsearch Guide [8.1] | Elastic
So I am not really sure what is causing this. Any ideas or insights would be great!