Auto balancer not working in the midst of disk space crisis

test_tester · December 1, 2020, 8:45am

Hello everyone,

So in my cluster, i have 3 nodes, with 70%, 88% and 65% disk space usage.

As you might expect, my autobalancer is not working despite my configuration being set as 'enabled:all'.

Afraid that my 2nd node would 'burst', I am currently exploring some alternate options to balance it myself.

I have two solutions thus far:
(1) Manual Rebalancing (Cluster Reroute API)
(2) Exclude shards allocation (cluster.routing.allocation.exclude._ip)

Can anyone advise for manual rebalancing -> if i simply move my shards from one node to another, will it impact my elastic search indexes in anyway (e.g. if i can see one document on ES head, and its origin is node 2, and it moves to node 1, would i still be able to see that document) like data loss.

Please advise for 'excluding' -> from what i read from the documentation, it is commonly used to 'shutdown node' but I don't intend to shut it down. By excluding, am i effectively shutting down the node? If I have 1 cluster with only 2 nodes and i run the exclude functionality, will everything in that cluster break?

Regards

DavidTurner · December 1, 2020, 8:52am

Rebalancing aims to balance the shard count across nodes, subject to various constraints. Although you don't say anything about shard counts here, I expect it's doing its job. Instead, you're looking for disk-based shard allocation which has already started to take effect (it won't be allocating any new shards to the 88% node) but won't start actively moving shards away until it gets to 90%. You can change the thresholds if you want.

test_tester · December 1, 2020, 8:56am

So, in my company we don't really want to take chances on relying on the fact that it is doing its job.

If we are set on going ahead with manually doing it ourselves, would you recommend the 2 approaches i mentioned?

DavidTurner · December 1, 2020, 9:02am

No, I would recommend pursuing neither of the suggestions in your OP.

The goal of disk-based shard allocation is to keep the disk usage below the high watermark, which defaults to 90%. If 88% disk usage is too high for your comfort then you should simply reduce the high watermark to a level at which you are happier. If you don't believe that Elasticsearch is doing as you ask then you can use GET _recovery to check up on it and make sure it's moving shards away to satisfy the watermarks.

test_tester · December 1, 2020, 9:37am

Hi David, noted on your recommendation -> its simply better to just let it run on its own, maybe I can even set the high watermark to 89% to prevent my node 2 from ever hitting 90%.

On another note, if i were to set the high watermark down to 80% right now at this instance, would my disk allocation try and achieve auto balancing to 80% as the threshold?

Meaning my current disk allocation would from 70%, 88%, 65%, become -> 74%, 74%, 74%?

DavidTurner · December 1, 2020, 9:45am

Yes. You'd also need to reduce the low watermark in that case, since the low watermark defaults to 85% and must always be set below the high watermark, but otherwise that seems like it will do what you want.

I wouldn't expect perfectly equal disk usages, no, but they will all end up under 80%.

test_tester · December 1, 2020, 9:45am

Hey David, thanks for the prompt responds, really appreciate it. Will be testing it out!

Once again thanks for the help!

test_tester · December 1, 2020, 10:35am

Hi David,

As what we discussed.

So I just tried out your suggestion on my test clusters and inputed the following command on ES Head. To change my low and high watermarks to 45% and 48% respectively.

Prior to the change, my disk allocation are as follows 53.87%, 53.93% and 45.95%.

The expected behaviour after running the query is that I would see node 3 inching upwards to 48% while node 1 and 2 would decrease closer towards 48% -> To balance out node disk equally.

Can you advise if I am doing anything wrongly?

DavidTurner · December 1, 2020, 11:07am

Yeah that's too tight, Elasticsearch won't move shards onto a node that exceeds the low watermark, which is why it's stopped at 45%.

system · December 29, 2020, 11:07am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Disk Allocation Threshold Elasticsearch	1	446	July 6, 2017
Shards not allocating based on disk space Elasticsearch	6	975	May 14, 2019
Shard relocation storms when cluster disk low Elasticsearch	11	2595	July 24, 2018
Sharding unbalance problem Elasticsearch	6	1169	July 6, 2017
Just initialize shards when problems but no rebalance Elasticsearch	7	505	July 6, 2017

Auto balancer not working in the midst of disk space crisis

Related topics