Shard reallocation and disk space

kernelpanic · July 7, 2020, 1:52pm

Hello, I'm using ElasticStack 7.6.1 on CentOS 8.

I recently upgraded the kernels on all three of our nodes so I then rebooted them all; after they were rebooted the cluster decided to reallocate some shards but for some reason it placed many large shards on one node to the extent that the node ran out of disk space and the cluster became inoperable. I was able to fix it by simply adding more space to the node and then rebooting all nodes, the cluster then evenly balanced the shards and roughly the disk space on each node.

I have a couple of questions regarding this though:

Shouldn't the disk high and low water mark settings have stopped the disk on the node from becoming full i.e. it should've stopped trying to move shards to that node once its watermark settings had been reached? (I haven't changed the settings so they are at the default).
As the disk did become full, If I didn't have the option of adding more space to the node then what else could I have done to force it to move data off that node? As far as I could tell the cluster was inoperable, so I'm not sure how I would've fixed this issue without first adding more space.

Thanks for any help.

rugenl · July 7, 2020, 1:58pm

Was your data path it's own filesystem? If it's shared, particularly with the FS containing /var, Elasticsearch data is in competition for other uses for disk space. Bad things result

In that case, freeing space, say in /var/log, will let Elasticsearch resume.

kernelpanic · July 7, 2020, 2:03pm

Hi rugenl thanks for replying, no they're not shared; all 3 nodes had a 5.5 TB (now expanded to 8 TB) dedicated XFS file system on a separate disk for Elasticsearch data.

If I recall correctly each node had about 2 TB of free space during the reallocation and I'd also stopped log ingestion whilst it was moving shards about, so I'm perplexed at how it managed to completely fill one node's disk.

rugenl · July 7, 2020, 2:16pm

Can you tell if it was one large index or a lot of indices? It could be an index with 1 primary and 0 replica, that would live on only one node.

kernelpanic · July 7, 2020, 2:25pm

Hi, unfortunately not since the cluster has now rebalanced itself successfully so I don't know precisely which index shards had moved to the problematic node. We don't have replicas due to not having the disk space to keep replicas as well as be able to retain the data for as long as we need to.

Our largest indexes are the ones for windows event logs, these indexes are composed of 33 shards which I assumed would be evenly balanced across 3 nodes.

system · August 4, 2020, 2:25pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Shards not allocating based on disk space Elasticsearch	6	892	May 14, 2019
Shard relocation storms when cluster disk low Elasticsearch	11	2533	July 24, 2018
Not releasing disk space after (failed?) shard allocations Elasticsearch	3	3564	October 16, 2019
One node taking much more space than others Elasticsearch elastic-stack-monitoring	2	2714	April 20, 2019
Shard reallocation stops Elasticsearch	11	4455	November 7, 2017

Shard reallocation and disk space

Related topics