Elastic refuses to balance disks, trying to send data to full cold disks, why?

Dave_Houser · May 7, 2025, 6:59pm

We found a couple weeks ago that our 3 node cold storage was filling up. It got to ~90% full. We added 3 more cold nodes around this time before it got past ~90% full. After doing so elastic slowly started removing data from old cold storage to new cold storage (I think). All I know is storage used size went down on the old cold nodes, and new data was sent to the new cold storage I think.
Something happened this week and elastic started sending most of the data to the old cold storage disks again.
Elastic is still trying to send data based on ILM to the old cold disks and the new cold disks. So new disks are getting ILM based warm data, but the old colds are too.

Why is elastic trying to send data to the full disks? Wouldn't it recognize that the disks are full and only send to the new not full ones?
Why is elastic not balancing data across disks?
Are we loosing data when elastic tries to send to the old disks but finds they are full? Does it drop the data?

What can I look up to share with you all? Can anyone help me troubleshoot this?

leandrojmp · May 7, 2025, 7:43pm

By disks you mean nodes, right?

Can you share the elasticsearch.yml of those nodes?

Dave_Houser · May 7, 2025, 8:10pm

Yes. These nodes are run via ECK pods, I do not have access to the elastic.yml file.
Note: we have no cluster watermark set. Will elastic not try to move data if there is no watermark?
I know if I set low watermark elastic will try to avoid sending data to storage, and a high watermark will have elastic avoid sending data and move data off of it if it can.
But if there is no watermark, does elastic not do this at all?

DavidTurner · May 7, 2025, 8:38pm

It depends on how large your disks are: 90% may not be full enough for Elasticsearch to consider avoiding allocating shards there. It's trying to balance lots of different things, not just disk usage.

In any case whenever you have questions about shard allocation the first thing to try is cluster allocation explain. This will tell you why a shard is where it is. If you think a shard shouldn't be on one of the cold nodes, e.g. because you think the node is too full, then use that API to explain it.

RainTown · May 8, 2025, 12:56pm

To be clear, do you mean you dont know how to get the nodes' elasticsearch.yml file, or it's some other teams job in your organization, you simply have no access, or ... ?

Reason I ask is in the original post there is a few "I thinks", "all I know .." , "something happened", ... which I find a little worrying, It suggests you have limited access to the system? Are you getting information second hand and just don't/can't know the details/specifics? If so, that would be a tough ask going forward, IMHO. Or maybe it's just your writing style?

On the specifics, "we have no cluster watermark set" is not how I would word it. I'd suggest you did not override the default settings? i.e., maybe semantics, but there are watermark/flood/disk-space/... settings, just not explicitly set by you.

The docs say (eg) that cluster.routing.allocation.disk.watermark.low defaults to 85% - that and many other settings/defaults are set out in the docs, e.g. at Cluster-level shard allocation and routing settings | Elastic Documentation

Topic		Replies	Views
Elastic 5.4 - watermarks and multiple disks issues Elasticsearch	5	610	March 3, 2019
Disk Allocation Threshold Elasticsearch	1	452	July 6, 2017
How to balance data between nodes by disk disk usage % Elasticsearch	1	1985	January 7, 2017
Managing ES servers with differing data disk sizes Elasticsearch	4	812	July 6, 2017
Unbalanced disk usage with ES 6.1.3 Elasticsearch	4	2586	May 1, 2018

Elastic refuses to balance disks, trying to send data to full cold disks, why?

Related topics