Retention Policy for a single growing index

(Shrikant) #1

I have a Single index and its growing daily and as i cant use Curator for data retention as it cant delete data from index

How should i proceed

A help would be really appreciated


(Aaron Mildenstein) #2

If you can, switch to using multiple indices behind an alias (hourly, daily, weekly...whatever works for your use case). This will allow you to reduce the need for delete_by_query, which is your only other recourse.

(Shrikant) #3

Thankyou for the reply :slight_smile:

As I am Using elastic stack for the logs storage purpose, if we go by indices on daily basis than it will not be possible because i need the logs for minimum three years.

please guide me how to proceed


(Aaron Mildenstein) #4

Why are you asking about the index growing daily if you need to keep data for 3 years? Do you only need to keep some of the data for 3 years? If so, then you need to put the data you want to keep for 3 years in one kind of index and data that can be deleted in another kind of index. You can make a retention policy that still uses daily, weekly, or monthly indices and keep what you want in separate indices for 3 years. You can refer to them by alias, allowing them to appear to be a single index, but have them be separate. Again: Aliases will make it possible to make multiple, disparate indices appear to be a single index, if that is what you require for queries. Your data does not all have to reside in a single, enormous index.

Additionally, you can't have a single enormous index do what you appear to be asking for anyway. S single shard should not exceed 50g (official recommendation from Elastic), so you'd best plan on having your indices rotate into new ones often enough to not exceed that number.

(Shrikant) #5

Hello Aaron
Thankyou for the reply,

I want to use the Elasticsearch as for storage purpose only, But I want to know How should i proceed as i will be storing the data for around 140 servers(windows and linux) and i want to keep the queryable data for the last three years(index should have data for the last three years).

Please correct me If i am wrong


(Aaron Mildenstein) #6

I'm really confused. First you want to know how to handle retention for a single large index because Curator can't delete data from an index. Now you explain that you are trying to keep data for 3 years. What you really have is a much broader question concerning index sizing, shard management, and cluster sizing if you want to keep data for 3 years. Instead of asking for the retention policy for a single growing index, you should be asking,

  • "How do I plan/architect a cluster to hold 3 years worth of log data?
  • "What are the things that might affect cluster performance?"
  • "Is this a good use case for Elasticsearch?"
  • "What is the maximum allowable delay before a query returns a value?"
    • "Can I close indices and only re-open them if a query on very old data needs to happen?"
  • "Where can I find training and/or other educational materials to help me learn how best to do these things?"

(Shrikant) #7

Hello Aaron,

Thankyou for correcting me,

Yes, I have many doubts as i am at the early stage of my learning. I have doubts Please guide me
> "How do I plan/architect a cluster to hold 3 years worth of log data?
(I will be getting around 1500 logs per 15 minutes per server and i have 140 servers)
> "What are the things that might affect cluster performance?"

Please guide
A help would be really appreciated


(Aaron Mildenstein) #8

By my calculations, that means approximately 20,160,000 log lines per day. That's only 233.34 per second, which is a relatively slow rate. You should be testing some of this data into a single node test cluster (like your laptop or local workstation), and see how much of your data an index with a single shard and no replicas can hold. You should apply a good index mapping for this data before beginning. You should see roughly how many docs it takes for that single shard to reach approximately 50g in size. This is your single-shard rollover point. You should be indexing into single-shard indices at that rate, and only rolling over (Rollover API) when you hit 50g shard sizes. This would theoretically allow you to put up to 30TB of data on a single node with a 31G heap. I'm guessing that's a lot of days worth of data. Simply add more nodes until you've reached your 3 year mark in terms of capacity. Start expiring old indices at this point.

(system) #9

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.