Configuring ILM hot cold delete policy on ES Cluster

I'm thinking to apply ILM policies on my ES cluster containing historical data that needs to be preserved for legal purposes with retention period being 7 years (85 months).

This cluster used to have day-wise indices. In order to boost search performances, I re-indexed day-wise indices into monthly indices.

The customer is interested only in last 18 months data. Rest of the data is kept for legal purposes and is queried seldomly.

Hence, I'm thinking to deploy hot-warm-cold-delete architecture in this.

  1. The delete policy is straightforward --> Delete monthly indices older than 85 months.

  2. As per ILM, the current month index should be on hot node. Since my current month index is monthly which is just single index of around 500 GB primary with 12 shards and 1 replica, does it make sense to have a HOT node containing just 1 index or 2 hot nodes containing 1 index and its replica?

I was thinking to go with hot-cold-delete architecture (no warm). i.e. keep last 18 months data in hot nodes. And rest of the data which is older than 18 months and less than 7 years in cold nodes.

My questions:

  1. Does this make sense?

  2. Or you'd suggest to have just current month index and its replica on hot nodes and data from previous month till previous 18 months on warm nodes?

  3. If I have to configure ILM, I've to specify "min_age" : "31d" in order for current monthly index to move from hot to warm/cold. Is that correct? Which means an index belonging to the month of April will not be moved to warm/cold node after 30 days but rather after 31 days.

  4. For my use case, I don't need to configure rollover.

ELK Stack Version is 6.8. All past monthly indices are forcemerged and use best_compression. Even the current month index uses best_compression. The indexing penalty due to best_compression on current index is acceptable since the data doesn't need to be immediately queried.

Thanks

@Christian_Dahlqvist - any thoughts? will really appreciate some pointers here. Went through the excellently written blog.

How many monthly indices do you have?

What is the size of indices/shards for the monthly indices?

How quickly would you need access to data older than 18 months? If it is for regulatory purposes, would restoring data from a snapshot be acceptable? Would you know which indices that would need to be restored?

If you are planning to store large data volumes on cold nodes I would recommend upgrading to the latest 7.x version as it could significantly reduce heap usage.

Hi Christian,

Thanks for the prompt reply and apologies for the delayed response. Got pulled into other work and this took a backseat.

Currently, I have 85 monthly indices beginning from Jan-2014 till Jan-2021 (current month's index). Since retention period is 85 months, on 01-Feb-2021, the Jan-2014 monthly index would get purged.

The monthly indices average about ~300 GB in size with 7 shards. Can change the shards to 5 if you'd suggest that or any other value.

Yes, restoring data from snapshot would be acceptable but it's better if data is directly available rather than triggering restore. We have a script to restore but NO, we do NOT know which indices out of those will contain the data. That's the problem. We will end up restoring all the indices older than 18 months which would defeat the purpose in first place.

Thanks for this and also, my mappings are all optimised in fact this has just 4 mappings and the fields are all keywords and searches happen only on one field alone. The frozen tier will be perfect for my use case since I off-load all the data older than 18 months to frozen tier and can be searched using searchable snapshots but at this moment, frozen tier is not yet GA though I've been told it'll be GA soon.

I also found that I cannot go for hot-cold because the forcemerge action is only available for WARM phase. Thus, I'll need to go through HOT-WARM-COLD. Please correct me if my understand is wrong here.

Your blog post mentions that

For nodes tuned for long-term data storage, it often makes sense to let them work as dedicated data nodes and minimize any additional work they need to perform. Directing all queries either to hot nodes or dedicated coordinating only nodes can help achieve this.

Can you please shed light on how can I achieve this? i.e how can I route the queries to hot/warm nodes?

Here's some info on my current configuration: The 6.8.6 cluster has 3 dedicated master nodes and 7 data nodes. [I can reduce the data nodes from 7 to 4 because the `best_compression` resulted in reduction of disk space by nearly **40%**]. The master nodes are minimal - 2 cores and 7GB RAM. The data nodes are powerful Azure VMs with 16 cores and 55 GB RAM.

The dataset is around ~24TB. Out of that the last 18 months dataset including replica is ~12TB. I've set replicas = 0 for data older than 18 months. [All that data is snapshotted and read-only so can restore in case of data loss or node down]. Dataset older than 18 months comes to be ~12TB without replica.

I'm beginning wonder if I'm adding too much complexity by hot-warm-cold here. The aim is to reduce costs but I think a lot of that has been achieved using best_compression.

That sounds fine the way it is. I see no need to change that.

As far as I know the fact that you go through different phases does not necessarily mean that you need to relocate shards to new nodes at every transition. I suspect you should be able to keep the data on the hot nodes for both HOT and WARM phases and then move it when it turns COLD. Not sure if this is something that might have changed over time.

Only give hot nodes to the clients querying Elasticsearch.

Thanks a ton Christian for the super prompt replies.

Excellent. So the clients should connect to the pool of hot nodes only. Got it. What about Kibana? The elasticsearch.hosts parameter in Kibana.yml should also connect to hot nodes or to all nodes?

This is great to hear. So will forcemerge action still kick in when the respective no of days have elapsed even though there's no warm phase?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.