Desired Balance Allocator stuck - preventing assignment of new shards

Hello,

On one of our larger stacks, we have recently seen (twice last week) a problem seemingly related to the new 'desired balance allocator'. The documentation seems to imply this is purely a background operation so that it wouldn't block any other cluster tasks, however we can't think of any other culprits in the situations we've encountered.

We're running version 8.6.2 with a backport of #93461.

Another thing that may be relevant: we've set cluster.routing.allocation.balance.shard to 0 to prioritise index and disk balancing.

The series of events that we saw were:

  1. Master node CPU gets pinned to one core fully worked up, and remains there (until resolution)
  2. During that time, ILM decides to do a rollover
  3. New index is created, alias rolled over, however new shards remain unassigned
  4. Ingest is now broken

Hot threads on master node shows:

   Hot threads at 2023-03-25T21:42:34.520Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   100.0% [cpu=100.0%, other=0.0%] (500ms out of 500ms) cpu usage by thread 'elasticsearch[elasticsearch-0-es-master-0][generic][T#1]'
     2/10 snapshots sharing following 16 elements
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.decider.AllocationDeciders.canRemain(AllocationDeciders.java:116)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator$Balancer.decideMove(BalancedShardsAllocator.java:876)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator$Balancer.moveShards(BalancedShardsAllocator.java:834)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator.allocate(BalancedShardsAllocator.java:183)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceComputer.compute(DesiredBalanceComputer.java:253)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceShardsAllocator$1.lambda$processInput$0(DesiredBalanceShardsAllocator.java:111)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceShardsAllocator$1$$Lambda$7874/0x0000000802233660.run(Unknown Source)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceShardsAllocator.recordTime(DesiredBalanceShardsAllocator.java:304)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceShardsAllocator$1.processInput(DesiredBalanceShardsAllocator.java:108)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceShardsAllocator$1.processInput(DesiredBalanceShardsAllocator.java:100)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.ContinuousComputation$Processor.doRun(ContinuousComputation.java:92)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:917)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
       java.base@19.0.2/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
       java.base@19.0.2/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
       java.base@19.0.2/java.lang.Thread.run(Thread.java:1589)
     2/10 snapshots sharing following 17 elements
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator$NodeSorter.reset(BalancedShardsAllocator.java:1419)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator$Balancer.buildWeightOrderedIndices(BalancedShardsAllocator.java:782)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator$Balancer.balanceByWeights(BalancedShardsAllocator.java:647)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator$Balancer.balance(BalancedShardsAllocator.java:506)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator.allocate(BalancedShardsAllocator.java:184)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceComputer.compute(DesiredBalanceComputer.java:253)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceShardsAllocator$1.lambda$processInput$0(DesiredBalanceShardsAllocator.java:111)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceShardsAllocator$1$$Lambda$7874/0x0000000802233660.run(Unknown Source)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceShardsAllocator.recordTime(DesiredBalanceShardsAllocator.java:304)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceShardsAllocator$1.processInput(DesiredBalanceShardsAllocator.java:108)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceShardsAllocator$1.processInput(DesiredBalanceShardsAllocator.java:100)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.ContinuousComputation$Processor.doRun(ContinuousComputation.java:92)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:917)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
       java.base@19.0.2/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
       java.base@19.0.2/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
       java.base@19.0.2/java.lang.Thread.run(Thread.java:1589)
     4/10 snapshots sharing following 14 elements
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator$Balancer.balance(BalancedShardsAllocator.java:506)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator.allocate(BalancedShardsAllocator.java:184)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceComputer.compute(DesiredBalanceComputer.java:253)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceShardsAllocator$1.lambda$processInput$0(DesiredBalanceShardsAllocator.java:111)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceShardsAllocator$1$$Lambda$7874/0x0000000802233660.run(Unknown Source)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceShardsAllocator.recordTime(DesiredBalanceShardsAllocator.java:304)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceShardsAllocator$1.processInput(DesiredBalanceShardsAllocator.java:108)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceShardsAllocator$1.processInput(DesiredBalanceShardsAllocator.java:100)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.ContinuousComputation$Processor.doRun(ContinuousComputation.java:92)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:917)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
       java.base@19.0.2/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
       java.base@19.0.2/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
       java.base@19.0.2/java.lang.Thread.run(Thread.java:1589)
     2/10 snapshots sharing following 13 elements
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator.allocate(BalancedShardsAllocator.java:181)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceComputer.compute(DesiredBalanceComputer.java:253)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceShardsAllocator$1.lambda$processInput$0(DesiredBalanceShardsAllocator.java:111)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceShardsAllocator$1$$Lambda$7874/0x0000000802233660.run(Unknown Source)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceShardsAllocator.recordTime(DesiredBalanceShardsAllocator.java:304)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceShardsAllocator$1.processInput(DesiredBalanceShardsAllocator.java:108)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceShardsAllocator$1.processInput(DesiredBalanceShardsAllocator.java:100)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.ContinuousComputation$Processor.doRun(ContinuousComputation.java:92)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:917)
       app/org.elasticsearch.server@8.6.2-SNAPSHOT/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
       java.base@19.0.2/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
       java.base@19.0.2/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
       java.base@19.0.2/java.lang.Thread.run(Thread.java:1589)

Unassigned shards are marked as being allowed to be assigned to the relevant nodes, however they are marked as 'no attempt' made at assigning them. Master node has those two tasks seemingly stuck:

indices:admin/rollover        20:41:35 1h
indices:admin/settings/update 20:41:35 1h

To resolve the situation, the first time we simply killed the master and let another master-eligible node take over; the second time we updated the balance setting mentioned above from 0 to some small non-zero value, which presumably reset or interrupted the desired balance calculation and allowed the master to resume normal operations.

Is this a known bug? Could it be related to the balance settings we're using, or some other thing?

1 Like

Not a known bug, no, and yes it could be related to the balance settings. And also that you're running a -SNAPSHOT build. Could you share the logs from the master node?

Here's all that got emitted around the time of the latest incident, so not much (debug off, sorry):

{"@timestamp":"2023-03-25T20:41:34.991Z", "log.level": "INFO", "message":"moving index [v60.tcpevent@1m-002516] from [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"check-rollover-ready\"}] to [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"attempt-rollover\"}] in policy [1m]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-0-es-master-0][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.xpack.ilm.IndexLifecycleTransition","elasticsearch.cluster.uuid":"Wk2FFMJbQGKUAPPeZjek2w","elasticsearch.node.id":"rDcsVg_eTbC4B1Yx5-3z0A","elasticsearch.node.name":"elasticsearch-0-es-master-0","elasticsearch.cluster.name":"elasticsearch-0"}
{"@timestamp":"2023-03-25T20:41:35.224Z", "log.level": "INFO", "message":"[v60.tcpevent@1m-002517] creating index, cause [rollover_index], templates [v60.tcpevent@1m], shards [16]/[1]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-0-es-master-0][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.metadata.MetadataCreateIndexService","elasticsearch.cluster.uuid":"Wk2FFMJbQGKUAPPeZjek2w","elasticsearch.node.id":"rDcsVg_eTbC4B1Yx5-3z0A","elasticsearch.node.name":"elasticsearch-0-es-master-0","elasticsearch.cluster.name":"elasticsearch-0"}
{"@timestamp":"2023-03-25T20:41:35.391Z", "log.level": "INFO", "message":"moving index [v60.tcpevent@1m-002517] from [null] to [{\"phase\":\"new\",\"action\":\"complete\",\"name\":\"complete\"}] in policy [1m]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-0-es-master-0][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.xpack.ilm.IndexLifecycleTransition","elasticsearch.cluster.uuid":"Wk2FFMJbQGKUAPPeZjek2w","elasticsearch.node.id":"rDcsVg_eTbC4B1Yx5-3z0A","elasticsearch.node.name":"elasticsearch-0-es-master-0","elasticsearch.cluster.name":"elasticsearch-0"}
{"@timestamp":"2023-03-25T20:41:35.496Z", "log.level": "INFO", "message":"moving index [v60.tcpevent@1m-002517] from [{\"phase\":\"new\",\"action\":\"complete\",\"name\":\"complete\"}] to [{\"phase\":\"hot\",\"action\":\"set_priority\",\"name\":\"set_priority\"}] in policy [1m]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-0-es-master-0][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.xpack.ilm.IndexLifecycleTransition","elasticsearch.cluster.uuid":"Wk2FFMJbQGKUAPPeZjek2w","elasticsearch.node.id":"rDcsVg_eTbC4B1Yx5-3z0A","elasticsearch.node.name":"elasticsearch-0-es-master-0","elasticsearch.cluster.name":"elasticsearch-0"}
{"@timestamp":"2023-03-25T21:57:27.635Z", "log.level": "INFO", "message":"updating [cluster.routing.allocation.balance.shard] from [0.0] to [0.0001]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-0-es-master-0][clusterApplierService#updateTask][T#1]","log.logger":"org.elasticsearch.common.settings.ClusterSettings","elasticsearch.cluster.uuid":"Wk2FFMJbQGKUAPPeZjek2w","elasticsearch.node.id":"rDcsVg_eTbC4B1Yx5-3z0A","elasticsearch.node.name":"elasticsearch-0-es-master-0","elasticsearch.cluster.name":"elasticsearch-0"}
{"@timestamp":"2023-03-25T21:57:28.776Z", "log.level": "INFO", "message":"moving index [v60.tcpevent@1m-002516] from [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"attempt-rollover\"}] to [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"wait-for-active-shards\"}] in policy [1m]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-0-es-master-0][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.xpack.ilm.IndexLifecycleTransition","elasticsearch.cluster.uuid":"Wk2FFMJbQGKUAPPeZjek2w","elasticsearch.node.id":"rDcsVg_eTbC4B1Yx5-3z0A","elasticsearch.node.name":"elasticsearch-0-es-master-0","elasticsearch.cluster.name":"elasticsearch-0"}
{"@timestamp":"2023-03-25T21:57:28.777Z", "log.level": "INFO", "message":"moving index [v60.tcpevent@1m-002517] from [{\"phase\":\"hot\",\"action\":\"set_priority\",\"name\":\"set_priority\"}] to [{\"phase\":\"hot\",\"action\":\"unfollow\",\"name\":\"branch-check-unfollow-prerequisites\"}] in policy [1m]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-0-es-master-0][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.xpack.ilm.IndexLifecycleTransition","elasticsearch.cluster.uuid":"Wk2FFMJbQGKUAPPeZjek2w","elasticsearch.node.id":"rDcsVg_eTbC4B1Yx5-3z0A","elasticsearch.node.name":"elasticsearch-0-es-master-0","elasticsearch.cluster.name":"elasticsearch-0"}
{"@timestamp":"2023-03-25T21:57:28.956Z", "log.level": "INFO", "message":"moving index [v60.tcpevent@1m-002517] from [{\"phase\":\"hot\",\"action\":\"unfollow\",\"name\":\"branch-check-unfollow-prerequisites\"}] to [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"check-rollover-ready\"}] in policy [1m]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-0-es-master-0][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.xpack.ilm.IndexLifecycleTransition","elasticsearch.cluster.uuid":"Wk2FFMJbQGKUAPPeZjek2w","elasticsearch.node.id":"rDcsVg_eTbC4B1Yx5-3z0A","elasticsearch.node.name":"elasticsearch-0-es-master-0","elasticsearch.cluster.name":"elasticsearch-0"}
{"@timestamp":"2023-03-25T21:57:28.957Z", "log.level": "INFO", "message":"moving index [v60.tcpevent@1m-002516] from [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"wait-for-active-shards\"}] to [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"update-rollover-lifecycle-date\"}] in policy [1m]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-0-es-master-0][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.xpack.ilm.IndexLifecycleTransition","elasticsearch.cluster.uuid":"Wk2FFMJbQGKUAPPeZjek2w","elasticsearch.node.id":"rDcsVg_eTbC4B1Yx5-3z0A","elasticsearch.node.name":"elasticsearch-0-es-master-0","elasticsearch.cluster.name":"elasticsearch-0"}
{"@timestamp":"2023-03-25T21:57:28.957Z", "log.level": "INFO", "message":"moving index [v60.tcpevent@1m-002516] from [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"update-rollover-lifecycle-date\"}] to [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"set-indexing-complete\"}] in policy [1m]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-0-es-master-0][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.xpack.ilm.IndexLifecycleTransition","elasticsearch.cluster.uuid":"Wk2FFMJbQGKUAPPeZjek2w","elasticsearch.node.id":"rDcsVg_eTbC4B1Yx5-3z0A","elasticsearch.node.name":"elasticsearch-0-es-master-0","elasticsearch.cluster.name":"elasticsearch-0"}
{"@timestamp":"2023-03-25T21:57:29.573Z", "log.level": "INFO", "message":"moving index [v60.tcpevent@1m-002516] from [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"set-indexing-complete\"}] to [{\"phase\":\"hot\",\"action\":\"complete\",\"name\":\"complete\"}] in policy [1m]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-0-es-master-0][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.xpack.ilm.IndexLifecycleTransition","elasticsearch.cluster.uuid":"Wk2FFMJbQGKUAPPeZjek2w","elasticsearch.node.id":"rDcsVg_eTbC4B1Yx5-3z0A","elasticsearch.node.name":"elasticsearch-0-es-master-0","elasticsearch.cluster.name":"elasticsearch-0"}
{"@timestamp":"2023-03-25T21:57:29.649Z", "log.level": "INFO", "message":"moving index [v60.tcpevent@1m-002516] from [{\"phase\":\"hot\",\"action\":\"complete\",\"name\":\"complete\"}] to [{\"phase\":\"warm\",\"action\":\"set_priority\",\"name\":\"set_priority\"}] in policy [1m]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-0-es-master-0][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.xpack.ilm.IndexLifecycleTransition","elasticsearch.cluster.uuid":"Wk2FFMJbQGKUAPPeZjek2w","elasticsearch.node.id":"rDcsVg_eTbC4B1Yx5-3z0A","elasticsearch.node.name":"elasticsearch-0-es-master-0","elasticsearch.cluster.name":"elasticsearch-0"}
{"@timestamp":"2023-03-25T21:57:30.095Z", "log.level": "INFO", "message":"moving index [v60.tcpevent@1m-002516] from [{\"phase\":\"warm\",\"action\":\"set_priority\",\"name\":\"set_priority\"}] to [{\"phase\":\"warm\",\"action\":\"unfollow\",\"name\":\"branch-check-unfollow-prerequisites\"}] in policy [1m]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-0-es-master-0][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.xpack.ilm.IndexLifecycleTransition","elasticsearch.cluster.uuid":"Wk2FFMJbQGKUAPPeZjek2w","elasticsearch.node.id":"rDcsVg_eTbC4B1Yx5-3z0A","elasticsearch.node.name":"elasticsearch-0-es-master-0","elasticsearch.cluster.name":"elasticsearch-0"}
{"@timestamp":"2023-03-25T21:57:30.172Z", "log.level": "INFO", "message":"moving index [v60.tcpevent@1m-002516] from [{\"phase\":\"warm\",\"action\":\"unfollow\",\"name\":\"branch-check-unfollow-prerequisites\"}] to [{\"phase\":\"warm\",\"action\":\"migrate\",\"name\":\"branch-check-skip-action\"}] in policy [1m]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-0-es-master-0][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.xpack.ilm.IndexLifecycleTransition","elasticsearch.cluster.uuid":"Wk2FFMJbQGKUAPPeZjek2w","elasticsearch.node.id":"rDcsVg_eTbC4B1Yx5-3z0A","elasticsearch.node.name":"elasticsearch-0-es-master-0","elasticsearch.cluster.name":"elasticsearch-0"}
{"@timestamp":"2023-03-25T21:57:30.173Z", "log.level": "INFO", "message":"moving index [v60.tcpevent@1m-002516] from [{\"phase\":\"warm\",\"action\":\"migrate\",\"name\":\"branch-check-skip-action\"}] to [{\"phase\":\"warm\",\"action\":\"migrate\",\"name\":\"migrate\"}] in policy [1m]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-0-es-master-0][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.xpack.ilm.IndexLifecycleTransition","elasticsearch.cluster.uuid":"Wk2FFMJbQGKUAPPeZjek2w","elasticsearch.node.id":"rDcsVg_eTbC4B1Yx5-3z0A","elasticsearch.node.name":"elasticsearch-0-es-master-0","elasticsearch.cluster.name":"elasticsearch-0"}
{"@timestamp":"2023-03-25T21:57:31.820Z", "log.level": "INFO", "message":"moving index [v60.tcpevent@1m-002516] from [{\"phase\":\"warm\",\"action\":\"migrate\",\"name\":\"migrate\"}] to [{\"phase\":\"warm\",\"action\":\"migrate\",\"name\":\"check-migration\"}] in policy [1m]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-0-es-master-0][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.xpack.ilm.IndexLifecycleTransition","elasticsearch.cluster.uuid":"Wk2FFMJbQGKUAPPeZjek2w","elasticsearch.node.id":"rDcsVg_eTbC4B1Yx5-3z0A","elasticsearch.node.name":"elasticsearch-0-es-master-0","elasticsearch.cluster.name":"elasticsearch-0"}
{"@timestamp":"2023-03-25T21:57:37.605Z", "log.level": "INFO",  "current.health":"GREEN","message":"Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[v60.tcpevent@1m-002517][7]]]).","previous.health":"YELLOW","reason":"shards started [[v60.tcpevent@1m-002517][7]]" , "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-0-es-master-0][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.routing.allocation.AllocationService","elasticsearch.cluster.uuid":"Wk2FFMJbQGKUAPPeZjek2w","elasticsearch.node.id":"rDcsVg_eTbC4B1Yx5-3z0A","elasticsearch.node.name":"elasticsearch-0-es-master-0","elasticsearch.cluster.name":"elasticsearch-0"}

Would you expect a bug in the disk-usage balancing specifically? As noted above, the only change from 8.6.2 we have in that custom build is #93461.

Let us know if you need the detail of settings we use, or anything else.
We'll update if it happens again.

Hm ok if the computation were stuck in a loop then I would have expected a log message at level INFO every minute, because of this PR:

The fact that there are no such log messages is very strange.

Admittedly there are not very extensive tests for non-default values of settings like cluster.routing.allocation.balance.shard, so you're quite some way off the trail at this point. Does setting cluster.routing.allocation.balance.shard: 0.0001 make the problem go away?

Oh, the is still not converged message? Nope, no trace of it in our logs.

As we've only bumped cluster.routing.allocation.balance.shard a couple of days ago, it's still too soon to say anything conclusive. The other balance settings are also non-default but shouldn't be too crazy: that 0-value seemed most likely to cause trouble, if any at all. Still, here's the config in case it's relevant:

"balance" : {
  "disk_usage" : "2e-10f",
  "index" : "1.0",
  "threshold" : "1.01",
  "shard" : "0.0001"
}

OK, thought about this some more, and looked at the code more closely, and I think I may have a hypothesis as to what is going on, and indeed it would be related to disk balancing, and made more likely when reducing the weight of shard balancing.

Unless I misunderstand the code, looks like balanceByWeights() could get stuck bouncing a shard between two nodes if the shard size differs from the average shard size (as used in minWeightDelta()) enough to overcome the threshold margin.
The two nodes also need to have sufficiently close shard counts and disk usages, hence setting cluster.routing.allocation.balance.shard to zero making it more likely to happen.

Should I open an issue on github, or would you want to look into it some more first and take care of it yourself?

Generally speaking the desired balance allocator has been causing us trouble since we moved to 8.6, not just with this bug appearing last week (e.g. difficulties with reaching a balanced state, possibly improved or fixed once #93977 lands). So in the meantime we are probably better off switching back to the legacy balancer.
Feel free to close the discussion at this point, or let us know if you would want any more information from us.

I think I'm missing something. AFAICT balanceByWeights() will always terminate because it visits each index exactly once, and each iteration of the inner loop will either bring minNode and maxNode closer together by at least 1, or else it discards a node-pair. But this stuff is tricky, and the code is old, and I've stared at it too much recently so maybe I'm blind to what you're seeing.

Moreover this code is shared between the legacy allocator and the desired-balance allocator so I would not expect switching allocators to help.

Yeah, as you said the code is unfortunately tricky to decipher and reason about, so I'm not sure either, but looks to me like L739: highIdx = relevantNodes - 1; actually resets min/maxNode to full range.

That's a scary thought if this bug could also be hit with legacy allocator, but wouldn't in that case the cluster_concurrent_rebalance limit make it more likely that we break out of the loop before hitting this near-tie situation?

It does, but only after tryRelocateShard has succeeded in improving the balance of the current index by at least 1.

I don't think so, ConcurrentRebalanceAllocationDecider only returns YES or THROTTLE so the balanced allocator moves (or at least simulates the movement of) the shard either way.

Oh, right: and THROTTLE isn't NO... Ouch.

What I was saying above is that the improvement from the shard relocation may in fact not be enough to prevent the allocator from thinking it should move it back.
But as you made me look again I realised (what I think is) the bug is specifically in #93461 that we backported, and not in the latest official release: and that's because the shard size fallback when adding/subtracting from node disk usage (clusterInfo.getShardSize(shardRouting, 0L)) may differ from the average returned by diskUsageInBytesPerShard() used in calculating the weight delta threshold. While the 'forecasted size' is consistently the same number everywhere in the calculation.

Uho, well I guess it's now out in the wild...

Right, so I've been working on trying to reproduce the bug. And as I already mentioned in the PR in question the situation I worried about above does prove problematic indeed.

However the symptom is different as the master node does complain about is still not converged: therefore it doesn't spin in the balanceByWeights loop as I imagined, but it is indeed the DesiredBalance.compute that fails to converge as you thought might be happening.

On the other hand, the absence of logs in our initial problem doesn't mean that it isn't the same bug: I also realised that the "make logging more intelligent" PR you linked above only made it into 8.7, not 8.6. So in our case that log message would only get printed every 10^6 iteration. In the small test cluster I used to reproduce, that happens soon enough, but in our big stack with thousands of shards, one iteration may well take few milliseconds and such a message wouldn't get printed even in the 2h we had the master node stuck for.

Steps to reproduce this:

  1. Create a (8.7.0) cluster with 3 nodes
  2. Make an index with 2 shards
  3. Index at least 1GB of data or so into the shard with highest id (using fixed routing)
  4. Trigger a rebalance calculation (e.g. with dummy adjustment of balance parameters)

Interesting. I can confirm that these steps do indeed reproduce the symptoms you describe, and the fundamental problem appears to relate to how disk-based balancing handles indices whose shard sizes vary wildly. In particular this affects both the legacy and desired balance allocators. I'm still analysing the problem but I have an idea for a fix.

Thank you so much for taking the time to distill it down to something so simple :slight_smile:

1 Like

Still not sure whether this was indeed the root cause of the original problem I raised here, but since I can't reproduce the exact same thing and we're no longer running the same configuration (so may not see it again), I'm marking this as resolved for now. We still did catch at least one bug! Thanks a lot for confirming and looking into it.

I opened #95090 which includes the reproduction you described as a test case, and my proposed fix. With the default balancer weights it needs the shards to vary in size by several hundred GiB.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.