My team is working on upgrade plans and making decisions about doing an Azure resource manager deployment so we can have 20 update domains and/or using shard allocation awareness. As a result of that work and one azure incident that impacted our cluster when it "shouldn't" have, we have some questions.
The definition of "red" for cluster health is "red At least one primary shard (and all of its replicas) is missing". Can someone define missing? Let's say the only instance of a shard (primary or replica) available is initializing? Will the cluster be red in that case? Asking because, while Azure update domains determine which data nodes can be restarted simultaneously, there is no information about how long they will wait between restarting nodes in different udpate domains. The real world possible scenario I'm wondering about is: let's say we had VM1 with shard 1 (primary) and shard 2 (replica) and VM2 with shard 1 (replica) and shard 2 (primary) and VM1 gets restarted, and comes back and then VM2 gets restarted while shard 1 (primary) is still initializing. Would the cluster state be red/data unavilable until that shard is in a started state?
We understand that shard size influences how long they take to initialize. But still.
Thanks in advance,