Shards are not relocated from excluded data nodes

a06ced31bae02498a46d · July 26, 2024, 12:02pm

When excluding certain nodes with cluster.routing.allocation.exclude._ip setting most of the shards are moved off of the nodes.
However, some nodes are still having a few shards. First 2 nodes here have 1 shard each even though they are excluded from allocation:

shards disk.indices disk.used disk.avail disk.total disk.percent host        ip          node
     1         13gb    78.1gb      3.3tb      3.4tb            2 10.0.54.39  10.0.54.39  es-data-i-06a6ccfe35e55a373
     1       13.1gb    74.1gb      3.3tb      3.4tb            2 10.0.40.130 10.0.40.130 es-data-i-05964a0d46869f1a0
   123        1.4tb     1.5tb      1.9tb      3.4tb           44 10.0.53.110 10.0.53.110 es-data-i-03cf7c9c7ef35d91b
   123        1.3tb     1.3tb        2tb      3.4tb           39 10.0.41.37  10.0.41.37  es-data-i-0cf78468318fbd107

Investigation with allocation explain API call reveals that those shards indeed should not be on those nodes:

  "can_remain_on_current_node": "no",
  "can_remain_decisions": [
    {
      "decider": "filter",
      "decision": "NO",
      "explanation": """node matches cluster setting [cluster.routing.allocation.exclude] filters [_ip:"10.0.35.3 OR 10.0.43.177 OR 10.0.40.130 OR 10.0.45.193 OR 10.0.43.124 OR 10.0.42.231 OR 10.0.42.179 OR 10.0.46.56 OR 10.0.52.223 OR 10.0.51.26 OR 10.0.50.74 OR 10.0.55.224 OR 10.0.52.197 OR 10.0.54.39 OR 10.0.53.177 OR 10.0.44.189 OR 10.0.32.136 OR 10.0.38.232 OR 10.0.32.108 OR 10.0.37.223 OR 10.0.34.143 OR 10.0.34.197 OR 10.0.33.133 OR 10.0.36.22 OR 10.0.48.18"]"""
    }
  ]

However, those shards can't be moved either, as error says max retries reached:

      "node_decision": "no",
      "weight_ranking": 3,
      "deciders": [
        {
          "decider": "max_retry",
          "decision": "NO",
          "explanation": "shard has exceeded the maximum number of retries [10] on failed relocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [failed_attempts[10]]"
        }
      ]

Upon forcing relocation with POST /_cluster/reroute?retry_failed=true Elasticsearch does attempt to move the shards, but fails to do so after 10 attempts spewing this in the logs:

Caused by: org.elasticsearch.env.ShardLockObtainFailedException: [usersearch_v23_54_production_users][18]: obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [6068348ms]
	at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:987) ~[elasticsearch-8.6.1.jar:?]
	at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:887) ~[elasticsearch-8.6.1.jar:?]
	at org.elasticsearch.index.IndexService.createShard(IndexService.java:429) ~[elasticsearch-8.6.1.jar:?]
	... 17 more

java.io.IOException: failed to obtain in-memory shard lock
	at org.elasticsearch.index.IndexService.createShard(IndexService.java:527) ~[elasticsearch-8.6.1.jar:?]
	at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:851) ~[elasticsearch-8.6.1.jar:?]
	at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:175) ~[elasticsearch-8.6.1.jar:?]
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:569) ~[elasticsearch-8.6.1.jar:?]
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShard(IndicesClusterStateService.java:508) ~[elasticsearch-8.6.1.jar:?]
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.createIndicesAndUpdateShards(IndicesClusterStateService.java:463) ~[elasticsearch-8.6.1.jar:?]
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:226) ~[elasticsearch-8.6.1.jar:?]
	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:538) ~[elasticsearch-8.6.1.jar:?]
	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:524) ~[elasticsearch-8.6.1.jar:?]
	at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:497) ~[elasticsearch-8.6.1.jar:?]
	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:428) ~[elasticsearch-8.6.1.jar:?]
	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:154) ~[elasticsearch-8.6.1.jar:?]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:850) ~[elasticsearch-8.6.1.jar:?]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:257) ~[elasticsearch-8.6.1.jar:?]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:223) ~[elasticsearch-8.6.1.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
	at java.lang.Thread.run(Thread.java:1589) ~[?:?]

Hence, those shards are effectively stuck on the nodes is ES can't obtain the lock as it is being held by [closing shard] operation.

Please advise on what to do in a case like this? We can reproduce it pretty consistently while disabling allocation for ~20 nodes on multiple clusters.

We are running this ES version on bare AWS EC2 nodes:

"version": {
    "number": "8.6.1",
    "build_flavor": "default",
    "build_type": "rpm",
    "build_hash": "180c9830da956993e59e2cd70eb32b5e383ea42c",
    "build_date": "2023-01-24T21:35:11.506992272Z",
    "build_snapshot": false,
    "lucene_version": "9.4.2",
    "minimum_wire_compatibility_version": "7.17.0",
    "minimum_index_compatibility_version": "7.0.0"
  }

DavidTurner · July 26, 2024, 12:40pm

I suggest you upgrade to at least 8.8 to pick up Async creation of IndexShard instances by DaveCTurner · Pull Request #94545 · elastic/elasticsearch · GitHub; when 8.15 is released, upgrade to that to pick up Async close of `IndexShard` by DaveCTurner · Pull Request #108145 · elastic/elasticsearch · GitHub too.

a06ced31bae02498a46d · July 26, 2024, 1:09pm

Thank you so much @DavidTurner ! I'm always amazed about the prompt turnaround here!

It will take a bit of time for us to upgrade as we typically roll new versions but we were planning to do it anyway. I'll come back to this if the upgrade won't help.

Thank you!

Topic		Replies	Views
Shards refuse to relocate to different nodes using cluster.routing.allocation.exclude Elasticsearch	3	2208	July 13, 2019
Remaining shards after using allocation.exclude Elasticsearch	4	479	March 17, 2022
Excluded nodes not receiving shards when re-included Elasticsearch	2	392	July 6, 2017
Fundamental question about ES data/shards Elasticsearch	3	417	July 6, 2017
Cluster.routing.allocation.exclude._name not working Elasticsearch	6	4690	January 10, 2019

Shards are not relocated from excluded data nodes

Related topics