Elasticsearch Curator - deleting indices by priorities (with python api)

Hi!
I have the following scenario that I want to handle with curator python api:
I want to delete indices when my cluster storage reaches a fixed water mark (for example - 1TB).
But, I want to delete indices by a certain order. I have 3 groups of indices with different priorities for deletion.

First, I want to delete indices from the first group, from the oldest to the newest (but not newer than 1 month) until the total space of the cluster reaches the desired water mark.
If the water mark is not reached, I want to continue and delete indices from the second group, from the oldest to the newest (but not newer than 3 months) until the total space of the cluster reaches the desired water mark.
From the third group I never delete indices.

I was not able to figure out how to implement this with curator python api.

I tried to filter out all indices but the ones from the first group which are newer than 1 month, which I want to delete first. Then I would have liked to call filter_by_space. But to do that, I had to provide disk_space parameter, which is the desired space that this specific group of indices should take. This was inconvenient for me because I only know how much space should the total cluster take, not this specific group of indices.
So my solution was:

Calculate the amount of storage I have to delete.
Calculate the size of the group of loud indices
Reduce the size to delete from the size of the group (check that it is not below zero) and pass the result to filter_by_space.
This solution is not clean and requires many calculations.

I would have liked to suggest adding filter_by_space would check how much space does the whole cluster take (and not the specific group of indices) OR add the possibility to receive space_to_clean, meaning - how much storage to delete, but maybe I am missing something.

Do you have any suggestions how to handle this scenario better?

Thanks.

I want to delete indices when my cluster storage reaches a fixed water mark (for example - 1TB).

While I can sympathize with your desire for this, this is not now, nor has it ever been a recommended approach to data retention in Elasticsearch. A search through the issues in the source code repository for Curator will reveal I added filter_by_space (or the older disk_space variant before v4) with reservations frequently expressed (and included as caveats in the Curator documentation) Why? Because shard allocation can result in unequal distribution of data, which means that while it might be just fine for some users to use this approach, it would be wholly inadequate, if not outright dangerous for others. This becomes even more of a problem when dealing with differing indices, containing different data in different amounts.

The second reason this is not recommended is that Elasticsearch is unable to report the amount of space consumed by indices in closed state, which could render usage report API calls completely inaccurate. This could result in closed indices being deleted erroneously, or worse, open indices getting deleted when that behavior was not desired.

The third reason this approach is not recommended is that a high shard count per node will affect a cluster's performance, regardless of how much disk space is used (or not used). Generally, on a node with a 30G heap (not system RAM size), you should not exceed 600 shards per node. This value scales (not necessarily linearly) with your heap size. A smaller heap means a smaller number of shards per node before things start to go south (index speed decreases, search performance decreases, garbage collections increase in frequency and duration, memory pressure increases). Setting an arbitrary watermark ignores these constraints. Users who haven't learned them or been affected by them believe they should be able to fit as many shards on a node as there is disk space to accommodate, which can lead to memory pressure, followed by a cascade of failures.

For these reasons, deleting indices exceeding a disk space watermark is not likely to be added to Curator until it becomes a practice marked as either acceptable or recommended by the core Elasticsearch developers. While you might be able to successfully argue that your particular use case may be a safe one in which to use this approach, providing it as an out-of-the-box feature in Curator makes it look like it is not only acceptable, but completely normal. I just can't do that.

My personal recommendation would be to stick with the hard limits you suggested, deleting anything exceeding the last 30 days of group 1, 90 days of group 2, while ignoring group 3. I know that's not likely to be viewed as an improvement on how you are doing things, but that is my considered opinion, presaged by all of the reasons why.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.