Share Rebalancing on large clusters (2.4)

eperry · December 14, 2016, 2:18am

Hey guys got a question.

I have a cluster with over 20k total shards (each index is 30 shards + 1 replication and is oh about 300GB each) on 18 Data nodes with ~24 cores on each node. oh and we are indexing 10K message per second all day long (about 1TB a day of data)

When _open a couple of indexes at a time the cluster re balances for a while
when doing maintenance on a node it takes for ever for it to rebalance

When the re-routing/rebalancing/recovering is happening my indexing slows way down.

So here are my questions

I know there are Heuristics on when the cluster chooses to re balance but I don't understand the meaning of the numbers so I am afraid to touch them. Any resources that can help describe this better (or should I look somewhere else)

https://www.elastic.co/guide/en/elasticsearch/reference/master/shards-allocation.html#_shard_balancing_heuristics

I have looked at the Thread Queues but don't see any threads being maxed out during the re-balancing. and I have played with the concurrent load balancing settings at the cluster level. but doing it slow (concurrent rebalancing 2 or three) or fast at +30 seems to have the same impact.

https://www.elastic.co/guide/en/elasticsearch/reference/master/shards-allocation.html#_shard_allocation_settings

eperry · December 14, 2016, 2:20am

curl -XPUT $HOSTNAME:9200/_cluster/settings -d '{
"transient" : {
"indices.recovery.max_bytes_per_sec": "5000mb",
"indices.recovery.concurrent_streams": 8,
"cluster.routing.allocation.node_concurrent_recoveries": 8,
"cluster.routing.allocation.node_initial_primaries_recoveries": 4,
"cluster.routing.allocation.cluster_concurrent_rebalance":  8,
"index.unassigned.node_left.delayed_timeout": "1m",
"index.refresh_interval" : "5s",
"cluster.routing.allocation.enable" : "all",
"cluster.routing.allocation.allow_rebalance" : "always"

}
}'

warkolm · December 14, 2016, 2:24am

Is the index is 300GB, or the shard?

eperry · December 14, 2016, 2:28am

The Index, ranges from 100GB - 300GB depending on the index. ( ~5 different indexes adding up to the 1TB a day)

the share should be be about 10GB for the 300GB index (300GB / 30 shards = 10gb) right?

we are thinking about adding shards because we are adding hosts.

We are ok with slow searches but can't deal with a backlog of indexing during recovery

eperry · December 22, 2016, 5:32pm

no idea's?

well, I did find one issue as that I was CPU/IO Bound. I added 4 more luns and went from 50Mbps to 500Mbps so that will help once I finish spliting the 15 nodes.

system · January 19, 2017, 5:33pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch 2.4 Shard Rebalancing Elasticsearch	7	889	May 27, 2019
Weird rebalancing strategy Elasticsearch	4	323	October 23, 2021
ES Constantly reballancing after restart Elasticsearch	8	1542	July 5, 2017
Elastic shard balancing / allocation Elasticsearch	1	403	June 15, 2023
Pros and cons of higher cluster_concurrent_rebalance Elasticsearch docker	1	339	October 4, 2019

Share Rebalancing on large clusters (2.4)

Related topics