Shard waiting for node even though the node is back

We have a patching run that is hitting our servers. I don't want relocation to happen during the patching, so I set the delay to a window longer than the patching takes. This generally works as the node goes down, then after patching, the recovery is very fast. I am finding that sometimes relocation still happens even though the node was down for a shorter duration than is set in the timeout. I happened to catch it tonight and I am a bit confused.

The patches take about ~50 min. I temp set the node_left delay to 90 min. I have a scenario where the node isback up, and there are some shards on it, but I still see this:

 "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "NODE_LEFT",
    "at" : "2021-11-13T03:32:36.422Z",
    "details" : "node_left [O7QKHiQQSgqWk3DCiMylYw]",
    "last_allocation_status" : "no_attempt"
  "can_allocate" : "allocation_delayed",
  "allocate_explanation" : "cannot allocate because the cluster is still waiting 5.5m for the departed node holding a replica to rejoin, despite being allowed to allocate the shard to at least one other node",

That node is:

    "O7QKHiQQSgqWk3DCiMylYw" : {
      "timestamp" : 1636779559240,
      "name" : "_data-5_1",
      "transport_address" : "",
      "host" : "",
      "ip" : "",
      "roles" : [

All nodes are up, including that one. It is even hosting other shards:

foo_v20_all            0     r      STARTED        930647  925.8mb _data-5_1
bar_v20_all               4     r      STARTED        585452    2.5gb _data-5_1
.kibana_task_manager      0     r      STARTED             2    6.9kb _data-5_1

The node had been up for about 40 min and still it is showing that it is waiting.

Actually... as I was writing this, they started initializing (before the timeout expired and the did not relocate):

foo_v20_all            0     r      STARTED        930647  925.8mb _data-5_1
bar_v20_all               4     r      STARTED        585452    2.5gb _data-5_1
.kibana_task_manager      0     r      STARTED             2    6.9kb _data-5_1
foo_v20_nons        27    r      INITIALIZING          _data-5_1
foo_v20_nons        6     r      INITIALIZING          _data-5_1

There were a few other shards initializing. Is it possible that as there are only so many shards that can initialize at once, even though the node is back, there is throttling, so allocation/explain still reports as waiting for the node? If I follow that hypothesis, if the other initialization delay longer than the timeout delay, then relocation will kick in?

Hoping someone can add some clarity here as when relocation kicks in this slows things down.


1 Like

It's hard to say from the info provided but the first thing I'd check is whether the master was just busy doing other things for all that time. The task that drops the delayed-allocation block runs at the 4th-highest priority level, so there's a bunch of other things that would preempt it. The pending tasks API would answer this.

Ah. So if there were tasks at a higher priority level that were consuming the master, it might not have gotten to the task of removing the allocation block. I assume then that when the tasks is picked up, if it is before the delayed timeout, it would then initialize as normal, but if the allocation delay has timed out by that time, it goes into relocation. Anyway, thanks for the tip on pending_tasks. I will check that next time I see it in this scenario.

Yes that's about right. It's not really "relocation", the shard will often be allocated back to the original node anyway, but that's not guaranteed after the allocation delay has elapsed.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.