ES Constantly reballancing after restart


I did a rolling restart freezing the cluster then restarting, but afterwards it has been rebalancing for over 24 hours. where it was stable prior to the restart

So some details

I have 9 data nodes, with about 3000 shards ( 6 indexes aday ) with over 800GB and receiving 10K new messages aver second. (3 master and 2 client nodes as well)

HEAP SIze 30GB, and each server has 16 to 24 CPU's 90GB Ram, With a 10GB Network and a 4TB lun spead over +40 Disks.

So I am wondering, if I should tweak the rebalancing thresh-hold , I have already tweaked the speed of the recovery

"transient" : {
"indices.recovery.max_bytes_per_sec": "8000mb",
"indices.recovery.concurrent_streams": 5,
"cluster.routing.allocation.node_concurrent_recoveries": 6,
"cluster.routing.allocation.cluster_concurrent_rebalance": 20,
"index.unassigned.node_left.delayed_timeout": "30m",
"cluster.routing.allocation.allow_rebalance" : "indices_all_active"

My freeze looks like:

"transient": {
"cluster.routing.allocation.enable": "none"

My load average is about 3, and IO wait is constant around 5% which is that during a stable time, and there is no more then a 1 or 2 IO Queue so does not look like any issues.

Any thoughts?

In our cluster, the cluster.routing.allocation is always disabled.
When restarting a node, i manually allocate the shards to the original node.

The reason is i found after enabling reallocation, some shards will be allocated to other nodes rather than the original(restarting) one.

Is this the same situation for you?

Why would you do that? Who cares where the shards are?

If the shard is allocate to a different node, the whole shard needs to be recovered.
ie. if the shard size is 1T, than 1T data is recovered from the primary shard.
It's too much slow for me. Especially when rolling restart.

And if i allocate it to the original one, the recovery is fast. Only a small part of the shard needs to be recovery.

Then reduce your shard size. 1TB is way too big.

What version are you on?

No my shards are like 10 to 20GB , that is an interesting idea to manually rebalance them but I think that is too much work. IMH

I tried a cluster_concurrent_rebalance the system becomes pretty bogged down.

@warkolm Warkolm, maybe I should think of it in a different way, it is not the number of shards recover which is my problem but the size of them? I mean my indexes are at 24, and not really seeing any CPU problems. Should I maybe go 50 or 100. Of course this is a production cluster so I would have to set up some kind of test but what do you think?

I agree, shards at 1TB is way to large. your search times must be painfully slow @chenjinyuan87

We have 4TB data, the _all field is disabled already. I don't know what else can i do?

Currently it's 1.3.4. And I'm planning to upgrade it to 5.0.
But i need to upgrade my custom plugins first.

@eperry @warkolm
In my opinion, too much shards also slow down the search.
So i think 5 shards is a good choice....

I think es should make a better reallocation strategy, make sure the shards are allocated to the original node?