Curator advancing ILM phase due to disk usage

This is a feature request for Curator.

I know that Elastic's official stance is that clusters should be sized for retention time requirements, and I know the answer to exhausting disk space is to enable automatic scaling in Elastic Cloud. Due to this, Elastic appears resistant to add the ability to trigger an ILM phase transition based on disk usage, That's why I believe there's room for this feature in Curator.

Consider that Curator currently has the ability to delete the oldest indices when disk usage hits a specified level. That is the equivalent to ILM's Delete phase, except that ILM cannot trigger the transition by disk usage, only by index age. By using disk usage, one would be able to extend retention times opportunistically by fully utilizing the available disk space, rather than leaving space on the table when ingestion is lower than expected. It also acts as a fallback to avoid running out of space and blocking completely when ingestion is higher than expected and auto-scale is not enabled. In most cases, I believe continuing to ingest current data takes precedence over maintaining the oldest indices.

In a multi-tier environment, the warm nodes, for example, allow speedier queries, but it's often not critical to maintain exactly 7 days of warm, and dropping to 6 when space is exhausted is fine. Similarly, if warm nodes are slightly oversized, retention could be extended to 8 or 9 days rather than forcing a move to cold at exactly 7 days, as long as it doesn't risk disk exhaustion. Plus, if cold or frozen is much larger, a day or two of high ingestion rate may not require more storage at that tier, but could exhaust warm node space if nothing is done. Such a spike in ingestion would require either a temporary growth in warm nodes (auto-scale) or an earlier transition from warm to cold for a few days (this feature).

If the ability to transition phases based on disk usage will never be added to ILM, then Curator could fill this role by monitoring a tier's total disk usage and triggering an early transition to the next tier when a threshold is passed.

I could see Curator following this process:

  1. Check disk usage on all nodes holding shards of matching indices at a matching ILM phase; do nothing if none exceed a specific threshold.
  2. Sort matching indices by age. If any shards of the oldest index are already relocating; do nothing.
  3. Use "POST _ilm/move/" api to relocate the oldest index from its current ILM phase/action/name to the next phase as specified.

Safeguards:

  • It's probably wise to set a minimum index age, like if ILM is set to move at 7 days, don't let Curator move an index younger than 4 days.
  • Usage threshold should probably be specified separately rather than using an existing watermark value.

One could use this ability in a couple of ways:

  • Failsafe: If targeting a warm retention of 7 days, size the nodes so they normally hold 7 days with a suitable margin, set ILM to 7 days, and curator only triggers an early move when an ingest spike actually happens.
  • Optimal use: If targeting a warm retention of 7 days, size the nodes so they normally hold 7 days with a suitable margin and set ILM to 8 or 9 days so curator triggers most days and the tier's storage is fully utilized.

Because ElasticCloud's granularity is powers of 2, if, for example, three 8G warm nodes hold 6 days of data but I want 7, my only options are to decrease my expectations and live with 6 days, or double node size to 16G and have 12 days of space available, whether or not it's used. Plus, if I do go with 6 (or increase to 12), I'm now loving on the edge where any ingest spike could cause me to hit watermarks or exhaust disk space.

This feature would allow me to fully utilize the nodes I have without enabling the fiscally-scary option of auto-scaling. Were I to experience an ingestion spike, Curator could start transitions a little sooner and certain queries might take a few milliseconds longer for a few days, but no auto-scaling would be needed and ingestion would not stop due to disk exhaustion.

What do you think? Do you see the utility in such a feature? Or is there an effort underway to add this ability to ILM soon?

The idea has enough merit that I took the liberty of adding an issue to GitHub:

1 Like

Thank you. I was hoping to get your attention and validation of it. Glad to see the github issue opened.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.