Ingest node disc usage keeps rising

I've created a new elastic cloud deployment 7.14.x - Hot tier contains 1 node with 2 GB disc, Warm tier 1 node 320 GB disc. I wont do any document/index updating, a human will occasionally check logs via Kibana.

I've set up ILM so all indices are moved to warm tier immediately, then are deleted after 1 day. So move to warm tier after 0 days, then move to delete after 1 day, no snapshot. I confirmed all indices have the right ILM policy applied.

I'm sending logs at the rate about 60 MB per hour, about 20 thousand documents per hour spread over 3 indices (log-yyyy-MM, err-yyyy-MM, audit-yyyy-MM). About 24 hours have passed and hot tier disk is showing around 30% (~700 MB) disc being used and is still steadily rising, while the warm tier disk is 0% (~1.2 GB). Hot tier instance CPU is steady below 20%.

Hot tier disk usage is surprising me, especially the constant and steady increase. I expected it to be much lower since I've setup ILM to move the data immediately.

Am I doing something wrong? Am I wrong in expecting I can minimize Hot tier resource usage? How can I know when will hot tier disk usage stabilize? will it ever stabilize? What is the disk being spent on and why is it steadily increasing if data input amount is stable?

This is less than 5% of traffic I'll be sending to this cluster. I plan to have over 100 different indices coming in from around 50 beat applications and about 200 in-house developed log senders. And I'm trying to get an estimate for an appropriate node resource sizes.

Hi @Tadija Welcome to the community and thanks for trying Elastic Cloud.

Thanks for taking the time to explain your case.

Can you please share your ILM Policy?

This is a bit interesting... ILM does not move data "Immediately" it runs in the background periodically. This is often a bit of a misconception. Here is a little more from a previous post.

Hi Stephen, thank you for taking the time to answer. Here is the IML in question:

PUT _ilm/policy/1-day-retention
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "set_priority": {
            "priority": 100
          }
        }
      },
      "warm": {
        "min_age": "0s",
        "actions": {
          "set_priority": {
            "priority": 50
          }
        }
      },
      "delete": {
        "min_age": "1d",
        "actions": {
          "delete": {
            "delete_searchable_snapshot": true
          }
        }
      }
    }
  }
}

Here is one index where applied lifecycle is visible:

"settings": {
    "index": {
      "lifecycle": {
        "name": "1-day-retention"
      },
      "routing": {
        "allocation": {
          "include": {
            "_tier_preference": "data_warm,data_hot"
          }
        }
      },
      "number_of_shards": "1",
      "provided_name": "app-log-2021.09",
      "creation_date": "1631731338880",
      "priority": "50",
      "number_of_replicas": "1",
      "uuid": "75TaKBoAQlib0FM-m6xACg",
      "version": {
        "created": "7140199"
      }
    }
  }

On Elastic Cloud nodes get resources allocated in proportion to their size. The amount of CPU and disk I/O is therefore likely proportional to the amount of RAM your node has. As indexing is both CPU and disk I/O intensive there will be a limit to how much data your small hot node can ingest per day even if you move data off to warm nodes frequently, and given the size of your hot node it is likely to be quite low. This does not sound like a balanced cluster to me.

I would generally recommend holding about 3-5 days worth of data on the hot nodes in Elastic Cloud. If we assume 3 days for simplicity this means that your hot node should be able to create around 600MB of indices every day, which is what you are currently achieving. This sounds very low compared to the size of your warm nodes so unless you require an extremely long retention period your hot node sound quite undersized.

If you have created 700MB of indices in 24 hours and this only represents 5% of the load you plan to put on the cluster it sounds like you will be generating around 14GB of indexed data per day. Given the rule of thumb I described earlier and assuming a bit over 3 days of data being held on the hot node you would need to increase the size of the node so it can hold around 50GB of data.

That sounds very inefficient and suboptimal. In order to efficiently utilize resources in your cluster it is also important to avoid creating a large number of small indices and shards, which as stated unfortunately seem to be what you are planning to do. I would recommend consulting the docs for best practices. For a cluster this size, try to make sure that shards are each at least around a few hundred MB in size. Remember that there is a limit to the number of shards a node can handle.

Christian, thank you for your answer, I appreciate the rough estimates. It figures that the smallest node can't handle all that traffic, even though I hoped it would.

My plan is to redirect logs to this new cluster in stages and to scale the cluster as I go, so it made sense to start from the smallest configuration Elastic Cloud is offering and add more as need. Currently I don't feel confident nor comfortable with calculating the needed size and sending all traffic to the node trusting my calculations are correct.

So is it correct that, presuming the traffic flow is consistent and stable, when I hit the sweet spot, the hot tier data usage will stop growing and oscillate around some percentage? Because ILM will kick in and transfer data to the warm tier node?

So something like this, after stable growth it will stabilize and flatline? Or will there be sudden drops after 3-5 day period pass? What is a good timeframe for deciding the cluster is stable enough for its current load?

image

If that is the case I plan to slowly add more traffic to the cluster and increase hot tier node power when disc usage goes about some threshold, lets say 75%. Does that make sense?

Are there any other metrices I should be looking at apart from disk utilization? Is there a rule of thumb when should adding some message queueing be considered?

Again, thank you guys for your time and advice, I hope I am not straying off-topic too much .

Hi @Tadija

tl;dr ...

Yes when you're ingest rate stabilizes

And you have a valid ILM policy

Yes your storage consumption on hot and warm will stabilize as well.

I've run a number of clusters that I don't touch now anywhere from low 10s of GB / day to nearly a TB/day ingest and both have are stable with respect to storage on all tiers using ILM.

We have many customers that have fixed cluster sizes with fixed ingest rate use ILM to manage index and storage life cycle.

There will be some variability as indexes are fill up and then migrate from hot to warm but it'll stay within a range if all components are operating correctly.

Just in case someone stumbles upon the same issue, this is what solved it for me:

When I went to "Index Management" all my Indices health had health status Green, but if I looked at their detail I could see a message looking something like:

[log-2021.09] lifecycle action [migrate] waiting for [1] shards to be moved to the [data_warm] tier (tier migration preference configuration is [data_warm,data_hot])

and the Current action was migrate.

I found a similar issue here: https://github.com/elastic/elasticsearch/issues/69347 and their solution was a workaround to some GUI bug where they set replica count to zero via API.

Because my cluster currently has no replica nodes, I edited the index configuration through devtools to look like this ("number_of_replicas" : 0):

PUT _ilm/policy/1-day-retention
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "set_priority": {
            "priority": 100
          }
        }
      },
      "warm": {
        "min_age": "0m",
        "actions": {
          "set_priority": {
            "priority": 50
          },
          "allocate": {
            "number_of_replicas": 0
          }
        }
      },
      "delete": {
        "min_age": "1d",
        "actions": {
          "delete": {
            "delete_searchable_snapshot": true
          }
        }
      }
    }
  }
}

After that, the hot node sent its data to warm node every 10 min (which is the default lookup time for ILM) and my hot node disk usage is hovering around 3% which is a great starting point for adding more load to the cluster

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.