Shards are not relocated from excluded data nodes

When excluding certain nodes with cluster.routing.allocation.exclude._ip setting most of the shards are moved off of the nodes.
However, some nodes are still having a few shards. First 2 nodes here have 1 shard each even though they are excluded from allocation:

shards disk.indices disk.used disk.avail disk.total disk.percent host        ip          node
     1         13gb    78.1gb      3.3tb      3.4tb            2 10.0.54.39  10.0.54.39  es-data-i-06a6ccfe35e55a373
     1       13.1gb    74.1gb      3.3tb      3.4tb            2 10.0.40.130 10.0.40.130 es-data-i-05964a0d46869f1a0
   123        1.4tb     1.5tb      1.9tb      3.4tb           44 10.0.53.110 10.0.53.110 es-data-i-03cf7c9c7ef35d91b
   123        1.3tb     1.3tb        2tb      3.4tb           39 10.0.41.37  10.0.41.37  es-data-i-0cf78468318fbd107

Investigation with allocation explain API call reveals that those shards indeed should not be on those nodes:

  "can_remain_on_current_node": "no",
  "can_remain_decisions": [
    {
      "decider": "filter",
      "decision": "NO",
      "explanation": """node matches cluster setting [cluster.routing.allocation.exclude] filters [_ip:"10.0.35.3 OR 10.0.43.177 OR 10.0.40.130 OR 10.0.45.193 OR 10.0.43.124 OR 10.0.42.231 OR 10.0.42.179 OR 10.0.46.56 OR 10.0.52.223 OR 10.0.51.26 OR 10.0.50.74 OR 10.0.55.224 OR 10.0.52.197 OR 10.0.54.39 OR 10.0.53.177 OR 10.0.44.189 OR 10.0.32.136 OR 10.0.38.232 OR 10.0.32.108 OR 10.0.37.223 OR 10.0.34.143 OR 10.0.34.197 OR 10.0.33.133 OR 10.0.36.22 OR 10.0.48.18"]"""
    }
  ]

However, those shards can't be moved either, as error says max retries reached:

      "node_decision": "no",
      "weight_ranking": 3,
      "deciders": [
        {
          "decider": "max_retry",
          "decision": "NO",
          "explanation": "shard has exceeded the maximum number of retries [10] on failed relocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [failed_attempts[10]]"
        }
      ]

Upon forcing relocation with POST /_cluster/reroute?retry_failed=true Elasticsearch does attempt to move the shards, but fails to do so after 10 attempts spewing this in the logs:

Caused by: org.elasticsearch.env.ShardLockObtainFailedException: [usersearch_v23_54_production_users][18]: obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [6068348ms]
	at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:987) ~[elasticsearch-8.6.1.jar:?]
	at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:887) ~[elasticsearch-8.6.1.jar:?]
	at org.elasticsearch.index.IndexService.createShard(IndexService.java:429) ~[elasticsearch-8.6.1.jar:?]
	... 17 more

java.io.IOException: failed to obtain in-memory shard lock
	at org.elasticsearch.index.IndexService.createShard(IndexService.java:527) ~[elasticsearch-8.6.1.jar:?]
	at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:851) ~[elasticsearch-8.6.1.jar:?]
	at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:175) ~[elasticsearch-8.6.1.jar:?]
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:569) ~[elasticsearch-8.6.1.jar:?]
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShard(IndicesClusterStateService.java:508) ~[elasticsearch-8.6.1.jar:?]
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.createIndicesAndUpdateShards(IndicesClusterStateService.java:463) ~[elasticsearch-8.6.1.jar:?]
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:226) ~[elasticsearch-8.6.1.jar:?]
	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:538) ~[elasticsearch-8.6.1.jar:?]
	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:524) ~[elasticsearch-8.6.1.jar:?]
	at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:497) ~[elasticsearch-8.6.1.jar:?]
	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:428) ~[elasticsearch-8.6.1.jar:?]
	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:154) ~[elasticsearch-8.6.1.jar:?]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:850) ~[elasticsearch-8.6.1.jar:?]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:257) ~[elasticsearch-8.6.1.jar:?]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:223) ~[elasticsearch-8.6.1.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
	at java.lang.Thread.run(Thread.java:1589) ~[?:?]

Hence, those shards are effectively stuck on the nodes is ES can't obtain the lock as it is being held by [closing shard] operation.

Please advise on what to do in a case like this? We can reproduce it pretty consistently while disabling allocation for ~20 nodes on multiple clusters.

We are running this ES version on bare AWS EC2 nodes:

"version": {
    "number": "8.6.1",
    "build_flavor": "default",
    "build_type": "rpm",
    "build_hash": "180c9830da956993e59e2cd70eb32b5e383ea42c",
    "build_date": "2023-01-24T21:35:11.506992272Z",
    "build_snapshot": false,
    "lucene_version": "9.4.2",
    "minimum_wire_compatibility_version": "7.17.0",
    "minimum_index_compatibility_version": "7.0.0"
  }

I suggest you upgrade to at least 8.8 to pick up Async creation of IndexShard instances by DaveCTurner · Pull Request #94545 · elastic/elasticsearch · GitHub; when 8.15 is released, upgrade to that to pick up Async close of `IndexShard` by DaveCTurner · Pull Request #108145 · elastic/elasticsearch · GitHub too.

2 Likes

Thank you so much @DavidTurner ! I'm always amazed about the prompt turnaround here!

It will take a bit of time for us to upgrade as we typically roll new versions but we were planning to do it anyway. I'll come back to this if the upgrade won't help.

Thank you!