Not releasing disk space after (failed?) shard allocations

We have a fairly small cluster of 3 nodes with ~40GB disks each and about 12 monthly indices with 1 replica. Date should be around 8GB of primaries, so 16GB with 1 replica. We run ES 6.2

We recently had a runaway process indexing way too much data over the last 3 months, which we missed until it was too late. The cluster ran out of disk space, and went into read-only mode.

Since this was effectively a production outage and we were unable to add physical disk space on short notice we scrambled to fix the problem.

First we stopped the input of more data and ran a delete-by-query to delete the documents from the runaway process. Toggling read-only mode and ES turning it back on, while ES kept (trying) to move shards across the cluster and failing due to disk space issues. Finally the nodes also crashed and restarted here and there.... in short, it was messy.

We disabled replicas to free up some emergency disk space, completed the delete-by-queries and ran an aggressive forcemerge (num_segments = 1) to bring the monthly indices back from ~15GB to 0.5GB.

However, we did not notice any major improvements in free disk space. At some point the whole cluster reported about 8GB of data without replicas, but the 3x 40GB disks were all 3 still in the red and almost full.

My assumption is that in the mess of moving shards, jumping to read-only mode and processes crashing some shard reallocations failed mid-flight and were never completed nor cleaned up. These partial failed shards are now taking up disk space without being managed by ES.

What would be the recommended approach to clean up and recover in these situations? Other posts make it clear it is not desirable to start messing with the data folder manually on a running cluster. However, it seems like these file chunks on disk are not managed by ES anymore at all, so there is no delete index API or allocation API to call to clean this up through the application.

I haven't been in this situation myself so I can't verify if this theory is correct or not. But there is another reason why an index may still take up more disk space than expected after deleting a large number of documents: Deleting a document doesn't mean it gets removed from disk, it is just flagged as deleted and will only be removed once the segment it's stored in gets merged with another segment. This is an automatic process in Elasticsearch but may not happen for days or weeks, depending on a number of factors. I see that you've run a force merge but that may not have cleaned up the largest segments.

You can check the number of deleted documents in a given index by running something like this:

curl -XGET -s 'http://localhost:9200/my_index/_stats'  | jq ''
  "count": 78546918,
  "deleted": 2600457

In this example, my_index has 78,5 million documents and 2,6 million deleted which means about 3% of the index is taken up by deleted documents. This isn't too bad, I've seen cases with up to 30%.

To resolve your disk problem, if there is a large number of deleted documents in your indices, you could try to reindex each old index to a new one, which will effectively remove all the deleted documents. But be aware that if you reindex inside the same cluster you will need more disk space while the operation is ongoing, only when the reindexing is done can you delete the old index and thus free up the disk space.

Yes, this is true. I've read a number of posts from people stuck with a red cluster after deleting files from the data folder so that's a dangerous path. I think your best choice is to reindex each index or perhaps restore good versions of the indices if you have taken recent snapshots - see Snapshot and Restore.

For future reference, and for future readers, I wouldn't recommend this as a way to get out of the situation you found yourself. Instead, I think it'd be best to snapshot any older indices and then completely delete them. Delete-by-query and force-merge both take time and increase the disk usage of indices in the short run and this may have made the problem worse, whereas deleting an entire index releases its disk space pretty much straight away. Once the problem is under control you can restore any deleted indices from snapshots and then cleanup using delete-by-query at leisure.

I'm also a bit puzzled how you got all the way to the 95% flood_stage watermark without noticing the problem. Elasticsearch should have been emitting warnings about disk space all the way from 85%. Was that not the case? Above 85% you should have been seeing yellow cluster health after creating a new index too, since replicas are not allocated above the low watermark. Maybe with monthly indices you aren't creating indices frequently enough for this to have happened?

There are a couple of pertinent changes coming in Elasticsearch 7.4.0 that are worth mentioning:

  • #42559 adds a feature that automatically releases the write block when disk usage drops.
  • #46079 fixes a bug in which disk-based shard allocation would sometimes overshoot, particularly if you have configured too many concurrent recoveries.

Indeed, it is a very bad idea to do anything yourself to the insides of the data folder, even if the cluster is not running. But it's not the case that these files on disk are not managed by Elasticsearch: Elasticsearch deletes leftover shard data on disk when that shard is fully allocated and settled (i.e. reports green health and none of its copies are relocating). This includes any leftover remnants of recoveries that failed part-way through, even if that failure was due to the node crashing and restarting. It's not clear whether this contradicts what you observed. Can you identify any specific shard folders that have not been cleaned up as I've described? Although I don't recommend making any changes in the data folder, it might be useful to at least look inside it to help here. The path to a shard folder in 6.2 is $DATA_PATH/nodes/0/indices/$INDEX_UUID/$SHARD_ID if that helps.

Note that delete-by-query involves writing each deletion to the translog, and the translog is then retained for a while to help with recoveries. You can see the size of the translog in the index stats, and if that's the real disk space consumer then you can adjust its retention settings to drop it more quickly.

Force-merging down to a single segment will indeed clean up any other segments, as soon as they're no longer in use (e.g. held by an ongoing search or an open scroll context).

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.