We have a patching run that is hitting our servers. I don't want relocation to happen during the patching, so I set the delay to a window longer than the patching takes. This generally works as the node goes down, then after patching, the recovery is very fast. I am finding that sometimes relocation still happens even though the node was down for a shorter duration than is set in the timeout. I happened to catch it tonight and I am a bit confused.
The patches take about ~50 min. I temp set the node_left delay to 90 min. I have a scenario where the node isback up, and there are some shards on it, but I still see this:
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "NODE_LEFT",
"at" : "2021-11-13T03:32:36.422Z",
"details" : "node_left [O7QKHiQQSgqWk3DCiMylYw]",
"last_allocation_status" : "no_attempt"
},
"can_allocate" : "allocation_delayed",
"allocate_explanation" : "cannot allocate because the cluster is still waiting 5.5m for the departed node holding a replica to rejoin, despite being allowed to allocate the shard to at least one other node",
That node is:
"O7QKHiQQSgqWk3DCiMylYw" : {
"timestamp" : 1636779559240,
"name" : "_data-5_1",
"transport_address" : "10.0.2.20:9300",
"host" : "10.0.2.20",
"ip" : "10.0.2.20:9300",
"roles" : [
All nodes are up, including that one. It is even hosting other shards:
foo_v20_all 0 r STARTED 930647 925.8mb 10.0.2.20 _data-5_1
bar_v20_all 4 r STARTED 585452 2.5gb 10.0.2.20 _data-5_1
.kibana_task_manager 0 r STARTED 2 6.9kb 10.0.2.20 _data-5_1
The node had been up for about 40 min and still it is showing that it is waiting.
Actually... as I was writing this, they started initializing (before the timeout expired and the did not relocate):
foo_v20_all 0 r STARTED 930647 925.8mb 10.0.2.20 _data-5_1
bar_v20_all 4 r STARTED 585452 2.5gb 10.0.2.20 _data-5_1
.kibana_task_manager 0 r STARTED 2 6.9kb 10.0.2.20 _data-5_1
foo_v20_nons 27 r INITIALIZING 10.0.2.20 _data-5_1
foo_v20_nons 6 r INITIALIZING 10.0.2.20 _data-5_1
There were a few other shards initializing. Is it possible that as there are only so many shards that can initialize at once, even though the node is back, there is throttling, so allocation/explain still reports as waiting for the node? If I follow that hypothesis, if the other initialization delay longer than the timeout delay, then relocation will kick in?
Hoping someone can add some clarity here as when relocation kicks in this slows things down.
Thanks!