Possible bug: Elasticsearch not honouring index.unassigned.node_left.delayed_timeout

juan.domenech · March 25, 2025, 4:42pm

In Elasticsearch latest (8.17.3) running in Kubernetes ECK (2.16.0) will wait for 5 minutes to assign an Unassigned shard no matter the value of index.unassigned.node_left.delayed_timeout in the index configuration.

Please help me to verify this issue before opening a case.

To reproduce the issue we will create a test 6 node cluster, configure an index template with custom settings for number replicas (3), delayed_timeout (5 seconds), index an event and terminate a pod.
Once the pod leaves, the expected result is Elasticsearch to assign the unassigned shard after ~5 seconds but it tries to wait for 5 minutes contradicting the documentation.

Via Explain command (full output at the end of the reproduction steps) we can see that the new setting of 5 seconds (5000 milliseconds) is in place but Elasticsearch intents to wait for 5 minutes (300,000 milliseconds):

  "can_allocate" : "allocation_delayed",
  "allocate_explanation" : "The node containing this shard copy recently left the cluster. Elasticsearch is waiting for it to return. If the node does not return within [4.9m] then Elasticsearch will allocate this shard to another node. Please wait.",
  "configured_delay_in_millis" : 5000,
  "remaining_delay_in_millis" : 295241,

In this example the pod comes back quickly and the recovery does not reach the 5 minutes mark but this behaviour matches what we have seen in more complex situations in production. Elasticsearch will wait.

To reproduce:

Create test cluster

cat <<EOF | kubectl apply -f -
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: quickstart
spec:
  version: 8.17.3
  nodeSets:
  - name: default
    count: 6
    podTemplate:
      spec:
        containers:
        - name: elasticsearch
          resources:
            requests:
              memory: 4Gi
              cpu: 1
            limits:
              memory: 4Gi
        volumes:
        - name: elasticsearch-data
          emptyDir: {}
EOF

Setup env and validate

PASSWORD=$(kubectl get secret quickstart-es-elastic-user -o go-template='{{.data.elastic | base64decode}}') && k port-forward service/quickstart-es-default 9200 &

% curl -u "elastic:$PASSWORD" -k "https://localhost:9200/_cat/health?v"
epoch      timestamp cluster    status node.total node.data shards pri relo init unassign unassign.pri pending_tasks max_task_wait_time active_shards_percent
1742917976 15:52:56  quickstart green           6         6      4   2    0    0        0            0             0                  -                100.0%

Create index template with our custom settings

curl -u "elastic:$PASSWORD" -k -XPUT "https://localhost:9200/_component_template/test_default" -H 'Content-Type: application/json' -d'
{
  "template" : {
    "settings" : {
      "index.number_of_shards" : "3",
      "index.unassigned.node_left.delayed_timeout": "5s"
    }
  }
}
';echo

curl -u "elastic:$PASSWORD" -k -XPUT "https://localhost:9200/_index_template/my-index" -H 'Content-Type: application/json' -d'
{
    "index_patterns": ["my-index-*"],
    "priority": 500,
    "composed_of": ["test_default"]
}
';echo

Create an event

curl -u "elastic:$PASSWORD" -k -XPOST "https://localhost:9200/my-index-000001/_doc/" -H 'Content-Type: application/json' -d'
{
  "@timestamp": "2099-11-15T13:12:00",
  "message": "GET /search HTTP/1.1 200 1070000",
  "user": {
    "id": "kimchy"
  }
}
';echo

Validate index template and presence of settings

% curl -u "elastic:$PASSWORD" -sk "https://localhost:9200/my-index-000001?pretty"
{
  "my-index-000001" : {
    "aliases" : { },
    "mappings" : {
      "properties" : {
        "@timestamp" : {
          "type" : "date"
        },
        "message" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "user" : {
          "properties" : {
            "id" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            }
          }
        }
      }
    },
    "settings" : {
      "index" : {
        "routing" : {
          "allocation" : {
            "include" : {
              "_tier_preference" : "data_content"
            }
          }
        },
        "number_of_shards" : "3",
        "provided_name" : "my-index-000001",
        "creation_date" : "1742918205375",
        "unassigned" : {
          "node_left" : {
            "delayed_timeout" : "5s"
          }
        },
        "number_of_replicas" : "1",
        "uuid" : "bK9d6lGXRBeiS9qOj_UMbg",
        "version" : {
          "created" : "8521000"
        }
      }
    }
  }
}

Please note delayed_timeout = 5 seconds

Terminate a pod (the third in this example)

kubectl delete pod quickstart-es-default-2

Explain command output

% curl -s -u "elastic:$PASSWORD" -k "https://localhost:9200/_cluster/allocation/explain?pretty"
{
  "note" : "No shard was specified in the explain API request, so this response explains a randomly chosen unassigned shard. There may be other unassigned shards in this cluster which cannot be assigned for different reasons. It may not be possible to assign this shard until one of the other shards is assigned correctly. To explain the allocation of other shards (whether assigned or unassigned) you must specify the target shard in the request to this API. See https://www.elastic.co/guide/en/elasticsearch/reference/8.17/cluster-allocation-explain.html for more information.",
  "index" : "my-index-000001",
  "shard" : 1,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "NODE_RESTARTING",
    "at" : "2025-03-25T16:05:39.835Z",
    "details" : "node_left [_dMiTTrdTg-NaqSFut2K2Q]",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "allocation_delayed",
  "allocate_explanation" : "The node containing this shard copy recently left the cluster. Elasticsearch is waiting for it to return. If the node does not return within [4.9m] then Elasticsearch will allocate this shard to another node. Please wait.",
  "configured_delay_in_millis" : 5000,
  "remaining_delay_in_millis" : 295241,
  "node_allocation_decisions" : [
    {
      "node_id" : "F5apKdxnRDSzgKKB6MkFNw",
      "node_name" : "quickstart-es-default-4",
      "transport_address" : "10.77.99.79:9300",
      "node_attributes" : {
        "ml.config_version" : "12.0.0",
        "ml.machine_memory" : "4294967296",
        "ml.allocated_processors_double" : "80.0",
        "k8s_node_name" : "xxx11542",
        "transform.config_version" : "10.0.0",
        "xpack.installed" : "true",
        "ml.allocated_processors" : "80",
        "ml.max_jvm_size" : "2147483648"
      },
      "roles" : [
        "data",
        "data_cold",
        "data_content",
        "data_frozen",
        "data_hot",
        "data_warm",
        "ingest",
        "master",
        "ml",
        "remote_cluster_client",
        "transform"
      ],
      "node_decision" : "yes"
    },
    {
      "node_id" : "iAyiHv3AS7WMd-KRVusFKg",
      "node_name" : "quickstart-es-default-5",
      "transport_address" : "10.77.104.205:9300",
      "node_attributes" : {
        "ml.config_version" : "12.0.0",
        "ml.machine_memory" : "4294967296",
        "ml.allocated_processors_double" : "80.0",
        "k8s_node_name" : "xxx11981",
        "xpack.installed" : "true",
        "transform.config_version" : "10.0.0",
        "ml.allocated_processors" : "80",
        "ml.max_jvm_size" : "2147483648"
      },
      "roles" : [
        "data",
        "data_cold",
        "data_content",
        "data_frozen",
        "data_hot",
        "data_warm",
        "ingest",
        "master",
        "ml",
        "remote_cluster_client",
        "transform"
      ],
      "node_decision" : "yes"
    },
    {
      "node_id" : "oyI-TM8ZQD6i1jP33gjrWA",
      "node_name" : "quickstart-es-default-3",
      "transport_address" : "10.77.68.85:9300",
      "node_attributes" : {
        "ml.config_version" : "12.0.0",
        "ml.machine_memory" : "4294967296",
        "ml.allocated_processors_double" : "128.0",
        "k8s_node_name" : "xxx15210",
        "transform.config_version" : "10.0.0",
        "xpack.installed" : "true",
        "ml.allocated_processors" : "128",
        "ml.max_jvm_size" : "2147483648"
      },
      "roles" : [
        "data",
        "data_cold",
        "data_content",
        "data_frozen",
        "data_hot",
        "data_warm",
        "ingest",
        "master",
        "ml",
        "remote_cluster_client",
        "transform"
      ],
      "node_decision" : "yes"
    },
    {
      "node_id" : "qFaFBK3YTVCH7MnSjQk9PQ",
      "node_name" : "quickstart-es-default-1",
      "transport_address" : "10.77.92.75:9300",
      "node_attributes" : {
        "ml.config_version" : "12.0.0",
        "ml.max_jvm_size" : "2147483648",
        "ml.allocated_processors" : "80",
        "k8s_node_name" : "xxx12200",
        "xpack.installed" : "true",
        "transform.config_version" : "10.0.0",
        "ml.allocated_processors_double" : "80.0",
        "ml.machine_memory" : "4294967296"
      },
      "roles" : [
        "data",
        "data_cold",
        "data_content",
        "data_frozen",
        "data_hot",
        "data_warm",
        "ingest",
        "master",
        "ml",
        "remote_cluster_client",
        "transform"
      ],
      "node_decision" : "yes"
    },
    {
      "node_id" : "h1aC0fU8SfGZ2lq0LJ2DRg",
      "node_name" : "quickstart-es-default-0",
      "transport_address" : "10.77.29.80:9300",
      "node_attributes" : {
        "ml.config_version" : "12.0.0",
        "ml.allocated_processors_double" : "128.0",
        "ml.machine_memory" : "4294967296",
        "k8s_node_name" : "xxx16412",
        "xpack.installed" : "true",
        "transform.config_version" : "10.0.0",
        "ml.max_jvm_size" : "2147483648",
        "ml.allocated_processors" : "128"
      },
      "roles" : [
        "data",
        "data_cold",
        "data_content",
        "data_frozen",
        "data_hot",
        "data_warm",
        "ingest",
        "master",
        "ml",
        "remote_cluster_client",
        "transform"
      ],
      "node_decision" : "no",
      "store" : {
        "matching_size_in_bytes" : 6061
      },
      "deciders" : [
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "a copy of this shard is already allocated to this node [[my-index-000001][1], node[h1aC0fU8SfGZ2lq0LJ2DRg], [P], s[STARTED], a[id=cEcC5IhdSbujh2EFh5PpNQ], failed_attempts[0]]"
        },
        {
          "decider" : "awareness",
          "decision" : "NO",
          "explanation" : "there are [2] copies of this shard and [5] values for attribute [k8s_node_name] ([xxx11542, xxx11981, xxx12200, xxx15210, xxx16412] from nodes in the cluster and no forced awareness) so there may be at most [1] copies of this shard allocated to nodes with each value, but (including this copy) there would be [2] copies allocated to nodes with [node.attr.k8s_node_name: xxx16412]"
        }
      ]
    }
  ]
}

DavidTurner · March 26, 2025, 7:57am

The behaviour you describe is deliberate: ECK may set the allocation_delay parameter when marking the node for shutdown using the put-shutdown API, and the docs say:

If you specify both a restart allocation delay and an index-level allocation delay, the longer of the two is used.

Note that this only applies when ECK knows that a node is restarting, and in this case it's much more efficient to wait for the node to return to the cluster instead of rushing to allocate shards to the remaining nodes which can cause a rebalancing storm.

DavidTurner · March 26, 2025, 8:01am

Actually I see a small docs bug here, the docs should mention that the default allocation_delay is 5m too, so that's what happens if ECK does not set the allocation_delay parameter.

juan.domenech · March 26, 2025, 9:44am

Thank you David for your quick answer.
But I'd like to add a couple of additional issues.

Users looking for information about how Elasticsearch manages shards allocation in the case of a failure, will be directed to Delaying allocation when a node leaves documentation page and the existence of this cluster default is not mentioned.
The documentation indicates that the way to change delayed_timeout is via _all/_settings index configuration but the fact that this setting will be ignore when it is lower than the cluster default is not mentioned on the above page.
The documentation indicates that the default value of delayed_timeout is 1 minute which is misleading.

Besides those items that can be improved expanding the documentation, this opens the question: How can I adjust the value of index.unassigned.node_left.delayed_timeout below 5 minutes?

Thanks.

DavidTurner · March 26, 2025, 10:04am

Yep fair point I added that to the docs bug report.

This setting is only ignored during a graceful node restart (i.e. kubectl delete pod). If the node is not expected to come back after a restart, or it falls out of the cluster unexpectedly, then index.unassigned.node_left.delayed_timeout will apply.

The value in use is under ECK's control since it's ECK that calls the put-shutdown API. I don't know if it's configurable there, you would need to ask some ECK experts about that.

Can you share a little more context? Why do you want to force faster replica allocations (and a potential rebalancing storm) when doing a graceful node restart?

juan.domenech · March 28, 2025, 1:16pm

Thank you David for your interest.

I'm chatting with my colleagues to articulate a proper response (I'm working on an Elasticsearch migration to K8 but I'm not a Kubernetes expert).

Topic		Replies	Views
Delayed unassigned shards Elasticsearch	6	4258	December 30, 2016
About index.unassigned.node_left.delayed_timeout setting Elasticsearch	4	4976	July 5, 2017
Elasticsearch rolling restart recovery is slow Elasticsearch	3	1258	January 10, 2020
UNASSIGNED NODE_LEFT Elasticsearch	1	224	October 23, 2023
Rolling restart, replica allocation, cluster.routing.allocation.enable vs. index.unassigned.node_left.delayed_timeout Elasticsearch	2	703	December 27, 2022

Possible bug: Elasticsearch not honouring index.unassigned.node_left.delayed_timeout

Related topics