Can Rollover API / ILM be used to keep only x days data in an index at any point of time

I have read a few threads regarding this question , Most of them suggest using DeleteBy Query as Rollover API seems to delete/move to other phase the indices and create the new ones based on given condition.

I could not find a way in which Rollover / ILM can work similar to the deleteBy Query which does as below

POST my-index/_delete_by_query?conflicts=proceed
{
 "query": {
  "range": {
   "@timestamp": {
    "lt": "now-90d/d"
   }
  }
 }
}

ILM works by deleting complete indices. It can not be used to delete data from within an index. If you need to delete some data from within an index, delete-by-query is what you need. You will need to trigger and run these yourself though. There is no way to do so from within Elasticsearch.

But usually delete by query isn't looked upon as an efficient solution.
@theuntergeek mentioned in the post What is the definitive way of only retaining 7 days of logs.

No, that is true. Using delete-by-query is a lot less efficient compared to deleting time-based indices.

He Mentioned " I highly recommend looking into the Rollover API for a way to simplify this for you. Then you can make your "non-time-series" index into a time-series index for all intents and purposes. "

I didn't quite understand how Rollover can help in this usecase

Do you have immutable data or do you perform updates?

Its a simple log index, so its a append only index ( no updates to the once indexed documents )

In that case I would recommend you switch to using rollover and ILM, ideally through the use of data streams.



Yes but again what Rollover does is that it deletes/moves the existing index to next phase based on a condition right and creates a new index as write index. In that way, a index would not have lets say 30 days past days of data at any time

For example lets say this is the policy below and the flow of indices in next image

With rollover you have a set of indices that hold the data covering the retention period and you query all of them at the same time through an index pattern or alias. Once the oldest index only contains data that is beyond the retention period, it will be deleted. This means that you at any point in time may have a bit more data available that your retention period specifies, but that is generally not a problem.

I do not see what the issue is. This is how most people manage retention in Elasticsearch.

So lets consider the policy and the flow of indices through the ILM ( seen in 2 images i sent in my previous post ) .
Lets assume I want the logs index to always keep last 2 days of data ( yesterday and today ) every time. Just like on Amazon at any time you can see last 3 months of orders that you did.

The hot phase has max age 2 days for rolling over to warm phase.

So on day 1 and day 2 , If i need to view the last 2 days data I have only 1 index/data stream under the alias name "logs_index" for example .

On day 3 , when rollover happens and new index is created the older one goes to warm phase.
Now if i need to view last 2 days data ( I would have to query new index and some part of old index )

Similarly on day 7 I would have to query index no. 4 and index no. 3 and so on based on this policy in the image in previous comment.

So how would Elasticsearch know which physical indices to query for to get exactly last 2 days data based on timestamps .

If you are viewing your data through Kibana and set the time picker to last 2 days, Kibana will query all the indices backing the data stream/index pattern with a time filter added to the query. This means data will in practice only be returned from the last 2 indices. Querying a set of indices that does not hold any relevant data based on the time range is very fast so it is not a performance issue.

No we don't use Kibana in production its done through Java Elasticsearch Rest client

OK, then you query the full index pattern and add a timestamp range clause to your queries to filter out the correct data.

Do you mean I cannot query using the alias name because under the alias only 1 active/write index would be there in hot phase and only that would be queried ... So in other words alias can only query the index in hot phase ?

Hence to query other indices as well I will need to use the regex name

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.