Shard rebalance issue

Running ES v7.2.0.

We excluded data from node 'es-dbs-022' using cluster.routing.allocation.exclude._name because of some unrelated reasons. After some time we returned the node back to cluster using the same API (setting value to null) and after that the cluster has been rebalancing for few days (is usually done in 30min), and the rebalance itself is behaving very weirdly.

We have 60 data nodes, and one specific index has 60 shards (+60 replicas). Usually nodes have 1-3 shards from this index, but es-dbs-022 has ~30 of these shards. Other than this, the node has just a few shards that are not from this index.

This is very abnormal, usually after few hours of rebalancing the node has its disk full and blocks all writing. I tried looking into logs but nothing of importance is shown.

I've put TRACE log level for and in here only useful thing i found is that based on BalancedShardsAllocator, the node has negative weight in half of log records. No other nodes have negative weight.

One other potential clue is this 'THROTTLE' log (happens often, bfcIx0i_Spm_aatNSspIkg is es-dbs-022):
{"log":"{\"type\": \"server\", \"timestamp\": \"2019-07-22T13:16:12,653+0000\", \"level\": \"TRACE\", \"component\": \"o.e.c.r.a.a.BalancedShardsAllocator\", \"\": \"es-research-cloud\", \"\": \"es-dbm-001\", \"cluster.uuid\": \"Tnbn6gyVRUWU4p-m--4gIA\", \"\": \"ATaVaYN6QZePAa5s5IMhsQ\", \"message\": \"Couldn't find shard to relocate from node [uDxmj5r8QEi-gvdSwu0HQw] to node [bfcIx0i_Spm_aatNSspIkg] allocation decision [THROTTLE]\" }\n","stream":"stdout","time":"2019-07-22T13:16:12.653302601Z"}

Anybody has any ideas? Thanks

Looks like an issue related to Try setting index.routing.allocation.total_shards_per_node on this index to prevent all the shards from becoming so concentrated.

1 Like

At first I used index.routing.allocation.exclude for this index and few other big ones. This fixed the issue. We brought another node to the cluster and the same issue happened, this time i used your index.routing.allocation.total_shards_per_node proposition, it also worked.

Same issue is now causing more problems, if i try to export batch data to the cluster, all shards from newly created index end up at the same node and the node dies because of writing overload:

I guess we can amend this by setting total_shards_per_node on index templates, but it is really frustrating. If this issue gets fixed, will it be applied to version 7.2.x?

Another question - would increasing cluster.routing.allocation.balance.index be a good idea to try? Thanks!

It's almost certain that this will not be fixed in the 7.2.x series: 7.2.0 is already released, so there will be no new features in this series, and addressing this is a very substantial piece of work.

You can certainly try it, but if you want to say to Elasticsearch "do not allocate more than N shards of this index to any one node or else that node will fall over" then total_shards_per_node is how to do that.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.