"cluster_concurrent_rebalance" : "40" was ignored

mannoj87 · August 3, 2020, 12:10pm

"cluster_concurrent_rebalance" : "40" -> We had this as the setting, when I changed cluster.routing.allocation.awareness.attributes=rack_id
After this I see relocating_shards = 1000, and it was relocating almost all the shards in the cluster and took ingestion to ES slow for more than 8hours until relocating settled. Is this is expected behaviour when I had a top clause called cluster_concurrent_rebalance-> 40?

DavidTurner · August 3, 2020, 1:05pm

When you change cluster.routing.allocation.awareness.attributes the shard movements are not rebalancing so are not affected by the cluster_concurrent_rebalance setting.

You have almost certainly adjusted other settings related to concurrent shard relocations too.

mannoj87 · August 4, 2020, 11:18am

Nope. Never touched other setting. Looks like enabling cluster awareness ignores and overrides whatever cluster_concurrent_rebalance settings. You may convert this into a bug if it is not intended to after simulation.

DavidTurner · August 4, 2020, 11:45am

How many nodes do you have in your cluster?

mannoj87 · August 4, 2020, 11:58am

I have almost 110 data nodes.

DavidTurner · August 4, 2020, 1:40pm

Can you share the output of GET _nodes/_master/settings?flat_settings and GET _cluster/settings?flat_settings please?

mannoj87 · August 4, 2020, 2:18pm

_nodes/_cluster/settings?flat_settings
{
"persistent": {
"cluster.routing.allocation.allow_rebalance": "always",
"cluster.routing.allocation.awareness.attributes": "rack_id",
"cluster.routing.allocation.balance.index": "0.8f",
"cluster.routing.allocation.balance.shard": "0.2f",
"cluster.routing.allocation.cluster_concurrent_rebalance": "5",
"cluster.routing.allocation.node_concurrent_recoveries": "40",
"cluster.routing.rebalance.enable": "all",
"indices.breaker.request.limit": "1%",
"indices.breaker.total.limit": "60%",
"indices.recovery.max_bytes_per_sec": "80mb"
},
"transient": {
"cluster.routing.allocation.cluster_concurrent_rebalance": "40",
"cluster.routing.allocation.enable": "all",
"indices.breaker.request.limit": "5%"
}
}

_nodes/_master/settings?flat_settings

{
"_nodes": {
"total": 1,
"successful": 1,
"failed": 0
},
"cluster_name": "prd-rotaes6",
"nodes": {
"2GRUaHHSQjavUJAuSJwPIA": {
"name": "prd-esmrota103.edu.et",
"transport_address": "10.22.50.59:9300",
"host": "prd-esmrota103.edu.et",
"ip": "10.22.50.59",
"version": "6.8.10",
"build_flavor": "default",
"build_type": "deb",
"build_hash": "537cb22",
"roles": [
"master",
"ingest"
],
"attributes": {
"ml.machine_memory": "16826671104",
"ml.max_open_jobs": "20",
"xpack.installed": "true",
"ml.enabled": "true"
},
"settings": {
"action.destructive_requires_name": "true",
"bootstrap.memory_lock": "true",
"client.type": "node",
"cluster.name": "prd-rotaes6",
"cluster.routing.allocation.disk.threshold_enabled": "true",
"cluster.routing.allocation.disk.watermark.high": "90%",
"cluster.routing.allocation.disk.watermark.low": "85%",
"cluster.routing.allocation.same_shard.host": "true",
"discovery.zen.minimum_master_nodes": "2",
"discovery.zen.ping.unicast.hosts": [
"prd-esmrota101.edu.et",
"prd-esmrota102.edu.et",
"prd-esmrota103.edu.et"
],
"gateway.recover_after_data_nodes": "3",
"gateway.recover_after_master_nodes": "2",
"gateway.recover_after_time": "2m",
"http.type": "security4",
"http.type.default": "netty4",
"indices.breaker.total.limit": "20%",
"indices.fielddata.cache.size": "5%",
"indices.queries.cache.size": "10%",
"network.bind_host": "0.0.0.0",
"network.publish_host": "prd-esmrota103.edu.et",
"node.attr.ml.enabled": "true",
"node.attr.ml.machine_memory": "16826671104",
"node.attr.ml.max_open_jobs": "20",
"node.attr.xpack.installed": "true",
"node.data": "false",
"node.master": "true",
"node.name": "prd-esmrota103.edu.et",
"path.data": [
"/var/lib/elasticsearch"
],
"path.home": "/usr/share/elasticsearch",
"path.logs": "/var/log/elasticsearch",
"pidfile": "/var/run/elasticsearch/elasticsearch.pid",
"transport.features.x-pack": "true",
"transport.type": "security4",
"transport.type.default": "netty4"
}
}
}
}

DavidTurner · August 4, 2020, 2:32pm

This tells us that you have indeed adjusted other settings related to concurrent shard relocations too:

Set that back to the default.

This is the intended behaviour btw, as I said above shard movement related to allocation awareness is quite distinct from rebalancing, so cluster_concurrent_rebalance has no effect.

mannoj87 · August 4, 2020, 2:56pm

Are you asking us to not to set anything in node_concurrent_recoveries ? Or make it null via API now?

DavidTurner · August 4, 2020, 2:58pm

Yes, that's correct. Setting it to null will remove it from the cluster settings and restore the default value.

node_concurrent_recoveries*cluster_concurrent_rebalance = Maximum relocation is it?

No, once again I will reiterate that cluster_concurrent_rebalance has no effect here.

mannoj87 · August 4, 2020, 4:49pm

"cluster.routing.allocation.cluster_concurrent_rebalance": "40",
"cluster.routing.allocation.node_concurrent_recoveries": null,

If I have had my cluster like above setting it wouldn't have run into 1000 relocations? And just 40 all time?
Lets say it has initiated 1000 and if I have run the setting to have "cluster.routing.allocation.node_concurrent_recoveries": 2 .Will it immediately reduce from 1000 to 2*110nodes=220 ?
I turned on rack awarness like below command.

curl -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'
{
"persistent" : {
"cluster.routing.allocation.awareness.attributes": "rack_id"
}
}
'

As soon as I figured relocation started, if I have rolled back this change by executing null to it will it stop relocation immediately? or it will continue relocation?

curl -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'
{
"persistent" : {
"cluster.routing.allocation.awareness.attributes": "null"
}
}
'

DavidTurner · August 4, 2020, 6:48pm

The default is to allow two recoveries (incoming or outgoing) per node. Since you have ~110 data nodes you will not have more than 110 concurrent recoveries with the default settings.

I also recommend setting cluster_concurrent_rebalance back to the default. 40 is extremely high and may mean that your cluster takes a very long time (possibly weeks) to fully stabilise.

These settings do not take effect immediately, only when starting a new recovery. Similarly if you remove all awareness attributes this will not cancel any recoveries but it will not trigger any new recoveries.

mannoj87 · August 4, 2020, 7:23pm

Got it . Thanks David.!!

Steve_Mushero · August 10, 2020, 9:32am

Dumb question - why would setting concurrent rebalancing lower shorten time to stability? I'd think the opposite.

mannoj87 · August 10, 2020, 10:55am

AFAIK: B'coz rebalance is supposed to happen in the background without much network activities and CPU tasks and what is affordable for the cluster to work without hindering the existing performance of the applications reads and writes. When I hit 1000 Relocations happening the cluster was almost equal to inaccessible as the app write latency to the cluster was in minutes. As soon as cluster started to stabilise around 80 relocations going on and the post calls came back normal to milliseconds. We need to choose what fits in our environment by looking how long does it take for number shards as per its size to rebalance and choose night hrs/non-peakhrs as downtime and increase the concurrent rebalance value and get rebalance done faster if acceptable by users.

DavidTurner · August 10, 2020, 12:12pm

Not a dumb question at all, and the honest answer is that I don't exactly know. The way Elasticsearch computes the shards to move to improve the balance of the cluster is pretty complicated and very heavily tuned for the default settings. If you permit too much concurrent rebalancing then it effectively looks too many moves into the future and that amount of speculation seems to lead to some poor decisions.

We could investigate this, it sounds like a possible bug, but it's a very long way from the top of our list given that it all behaves as expected with the default settings and that adjusting these settings is almost always the wrong thing to do anyway.

Steve_Mushero · August 10, 2020, 2:14pm

Thanks (though thought the balancer was more real-time with a set of current weights, etc.; not aware it's looking in the future; now I have to go look at the code, or I'll wonder forever

Steve_Mushero · August 11, 2020, 2:42am

Hmm, BalancedShardsAllocator.java is 1200 lines of fairly complex allocation code (and more loops/sorts then I'd have expected), and I see you've been simplifying it which makes you the expert

Though I don't see much in the way of looking ahead or costing other than the index/shard weight deltas, but throttling is sprinkled around and presumably has implicit effects.

DavidTurner · August 11, 2020, 6:17am

TBF ~400 of those lines are comments or blank, but yeah as I said it's pretty complicated and heavily tuned

The lookahead is embedded in the fact that it runs while relocations are already ongoing and pretends that they've all completed, and treats new relocations similarly. IIRC it also assumes any relocations that can't currently occur due to throttling will eventually be unthrottled and complete.

system · September 8, 2020, 6:18am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Pros and cons of higher cluster_concurrent_rebalance Elasticsearch docker	1	343	October 4, 2019
Weird rebalancing strategy Elasticsearch	4	326	October 23, 2021
Share Rebalancing on large clusters (2.4) Elasticsearch	5	898	January 19, 2017
Disabling all the relocation that is happening in the cluster Elasticsearch	1	389	August 24, 2018
Version 8.10 still has recovery/rebalance issue? Elasticsearch	5	115	April 1, 2024

"cluster_concurrent_rebalance" : "40" was ignored

Related topics