"cluster_concurrent_rebalance" : "40" was ignored

"cluster_concurrent_rebalance" : "40" -> We had this as the setting, when I changed cluster.routing.allocation.awareness.attributes=rack_id
After this I see relocating_shards = 1000, and it was relocating almost all the shards in the cluster and took ingestion to ES slow for more than 8hours until relocating settled. Is this is expected behaviour when I had a top clause called cluster_concurrent_rebalance-> 40?

When you change cluster.routing.allocation.awareness.attributes the shard movements are not rebalancing so are not affected by the cluster_concurrent_rebalance setting.

You have almost certainly adjusted other settings related to concurrent shard relocations too.

Nope. Never touched other setting. Looks like enabling cluster awareness ignores and overrides whatever cluster_concurrent_rebalance settings. You may convert this into a bug if it is not intended to after simulation.

How many nodes do you have in your cluster?

I have almost 110 data nodes.

Can you share the output of GET _nodes/_master/settings?flat_settings and GET _cluster/settings?flat_settings please?

_nodes/_cluster/settings?flat_settings
{
"persistent": {
"cluster.routing.allocation.allow_rebalance": "always",
"cluster.routing.allocation.awareness.attributes": "rack_id",
"cluster.routing.allocation.balance.index": "0.8f",
"cluster.routing.allocation.balance.shard": "0.2f",
"cluster.routing.allocation.cluster_concurrent_rebalance": "5",
"cluster.routing.allocation.node_concurrent_recoveries": "40",
"cluster.routing.rebalance.enable": "all",
"indices.breaker.request.limit": "1%",
"indices.breaker.total.limit": "60%",
"indices.recovery.max_bytes_per_sec": "80mb"
},
"transient": {
"cluster.routing.allocation.cluster_concurrent_rebalance": "40",
"cluster.routing.allocation.enable": "all",
"indices.breaker.request.limit": "5%"
}
}

_nodes/_master/settings?flat_settings

{
"_nodes": {
"total": 1,
"successful": 1,
"failed": 0
},
"cluster_name": "prd-rotaes6",
"nodes": {
"2GRUaHHSQjavUJAuSJwPIA": {
"name": "prd-esmrota103.edu.et",
"transport_address": "10.22.50.59:9300",
"host": "prd-esmrota103.edu.et",
"ip": "10.22.50.59",
"version": "6.8.10",
"build_flavor": "default",
"build_type": "deb",
"build_hash": "537cb22",
"roles": [
"master",
"ingest"
],
"attributes": {
"ml.machine_memory": "16826671104",
"ml.max_open_jobs": "20",
"xpack.installed": "true",
"ml.enabled": "true"
},
"settings": {
"action.destructive_requires_name": "true",
"bootstrap.memory_lock": "true",
"client.type": "node",
"cluster.name": "prd-rotaes6",
"cluster.routing.allocation.disk.threshold_enabled": "true",
"cluster.routing.allocation.disk.watermark.high": "90%",
"cluster.routing.allocation.disk.watermark.low": "85%",
"cluster.routing.allocation.same_shard.host": "true",
"discovery.zen.minimum_master_nodes": "2",
"discovery.zen.ping.unicast.hosts": [
"prd-esmrota101.edu.et",
"prd-esmrota102.edu.et",
"prd-esmrota103.edu.et"
],
"gateway.recover_after_data_nodes": "3",
"gateway.recover_after_master_nodes": "2",
"gateway.recover_after_time": "2m",
"http.type": "security4",
"http.type.default": "netty4",
"indices.breaker.total.limit": "20%",
"indices.fielddata.cache.size": "5%",
"indices.queries.cache.size": "10%",
"network.bind_host": "0.0.0.0",
"network.publish_host": "prd-esmrota103.edu.et",
"node.attr.ml.enabled": "true",
"node.attr.ml.machine_memory": "16826671104",
"node.attr.ml.max_open_jobs": "20",
"node.attr.xpack.installed": "true",
"node.data": "false",
"node.master": "true",
"node.name": "prd-esmrota103.edu.et",
"path.data": [
"/var/lib/elasticsearch"
],
"path.home": "/usr/share/elasticsearch",
"path.logs": "/var/log/elasticsearch",
"pidfile": "/var/run/elasticsearch/elasticsearch.pid",
"transport.features.x-pack": "true",
"transport.type": "security4",
"transport.type.default": "netty4"
}
}
}
}

This tells us that you have indeed adjusted other settings related to concurrent shard relocations too:

Set that back to the default.

This is the intended behaviour btw, as I said above shard movement related to allocation awareness is quite distinct from rebalancing, so cluster_concurrent_rebalance has no effect.

Are you asking us to not to set anything in node_concurrent_recoveries ? Or make it null via API now?

Yes, that's correct. Setting it to null will remove it from the cluster settings and restore the default value.

node_concurrent_recoveries*cluster_concurrent_rebalance = Maximum relocation is it?

No, once again I will reiterate that cluster_concurrent_rebalance has no effect here.

"cluster.routing.allocation.cluster_concurrent_rebalance": "40",
"cluster.routing.allocation.node_concurrent_recoveries": null,
  1. If I have had my cluster like above setting it wouldn't have run into 1000 relocations? And just 40 all time?

  2. Lets say it has initiated 1000 and if I have run the setting to have "cluster.routing.allocation.node_concurrent_recoveries": 2 .Will it immediately reduce from 1000 to 2*110nodes=220 ?

  3. I turned on rack awarness like below command.

curl -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'
{
"persistent" : {
"cluster.routing.allocation.awareness.attributes": "rack_id"
}
}
'

As soon as I figured relocation started, if I have rolled back this change by executing null to it will it stop relocation immediately? or it will continue relocation?

curl -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'
{
"persistent" : {
"cluster.routing.allocation.awareness.attributes": "null"
}
}
'

The default is to allow two recoveries (incoming or outgoing) per node. Since you have ~110 data nodes you will not have more than 110 concurrent recoveries with the default settings.

I also recommend setting cluster_concurrent_rebalance back to the default. 40 is extremely high and may mean that your cluster takes a very long time (possibly weeks) to fully stabilise.

These settings do not take effect immediately, only when starting a new recovery. Similarly if you remove all awareness attributes this will not cancel any recoveries but it will not trigger any new recoveries.

Got it . Thanks David.!!

Dumb question - why would setting concurrent rebalancing lower shorten time to stability? I'd think the opposite.

AFAIK: B'coz rebalance is supposed to happen in the background without much network activities and CPU tasks and what is affordable for the cluster to work without hindering the existing performance of the applications reads and writes. When I hit 1000 Relocations happening the cluster was almost equal to inaccessible as the app write latency to the cluster was in minutes. As soon as cluster started to stabilise around 80 relocations going on and the post calls came back normal to milliseconds. We need to choose what fits in our environment by looking how long does it take for number shards as per its size to rebalance and choose night hrs/non-peakhrs as downtime and increase the concurrent rebalance value and get rebalance done faster if acceptable by users.

Not a dumb question at all, and the honest answer is that I don't exactly know. The way Elasticsearch computes the shards to move to improve the balance of the cluster is pretty complicated and very heavily tuned for the default settings. If you permit too much concurrent rebalancing then it effectively looks too many moves into the future and that amount of speculation seems to lead to some poor decisions.

We could investigate this, it sounds like a possible bug, but it's a very long way from the top of our list given that it all behaves as expected with the default settings and that adjusting these settings is almost always the wrong thing to do anyway.

Thanks (though thought the balancer was more real-time with a set of current weights, etc.; not aware it's looking in the future; now I have to go look at the code, or I'll wonder forever :wink:

Hmm, BalancedShardsAllocator.java is 1200 lines of fairly complex allocation code (and more loops/sorts then I'd have expected), and I see you've been simplifying it :slight_smile: which makes you the expert :wink:

Though I don't see much in the way of looking ahead or costing other than the index/shard weight deltas, but throttling is sprinkled around and presumably has implicit effects.

TBF ~400 of those lines are comments or blank, but yeah as I said it's pretty complicated and heavily tuned :wink:

The lookahead is embedded in the fact that it runs while relocations are already ongoing and pretends that they've all completed, and treats new relocations similarly. IIRC it also assumes any relocations that can't currently occur due to throttling will eventually be unthrottled and complete.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.