What is the definitive way of only retaining 7 days of logs

Our haproxy logs are eating up a large amount of space in logstash indexes, what is the "correct" / "best" way of setting up logstash only to retain 7 days of logs in elastic?

Step by step guide would be much appreciated.

From what I can tell I need to define a custom template and then delete old template and set logstash to use it.

Hi,

I would create a cron job and call below timestamp range query to retain logs for x days.

DELETE /index/type/_query
{
  "query": {
    "range": {
      "@timestamp": { 
        "lt": "now-7d"
      }
    }
  }
}

Hope this would help.

I would not recommend a delete-by-query operation to prune time-series data. It's awfully hard on your segments and disk i/o—especially if you're indexing heavily while you're trying to run a delete-by-query.

Logstash, by default, will create one index per day, named logstash-XXXX.YY.ZZ, where XXXX is the year, YY is the month, and ZZ is the day. Using Elasticsearch Curator, you can easily delete time-series indices:

curator --host elasticsearch.host.tld delete indices --older-than 7 --time-unit days --timestring %Y.%m.%d

There are many more operations which Curator makes easy!

Interesting, so you would but the curator rule in cron and just run it daily?

What about the TTL stuff? is it better to avoid it and stick to default template?

Interesting, so you would but the curator rule in cron and just run it daily?

Yes.

What about the TTL stuff? is it better to avoid it and stick to default template?

Oh, yes.

TTLs at first seem like a good idea. "Oh! I can just set this up and it will auto-prune when it hits the pre-defined TTL." The reality is that while this works, it is a Really Bad Idea™ with time-series data, where you know it will always expire in a predictable way.

TTLs force Elasticsearch to check every single document, every 60 seconds (an editable default, but the principle remains). If I have 1,000,000,000 records per day, then I have as many as 1,000,000,000 documents TTLs being checked every 60 seconds, with a 1 day TTL. You can imagine the strain that puts on the disk subsystem, not to mention the hit it would be to queries. On top of this, a TTL-deleted document is not immediately deleted. It is marked for deletion (yep, another I/O operation), and then the delete happens at the next segment merge. Segment merges will, of necessity, be very frequent because of TTLs, which adds to the disk I/O strain. Even if I configure the TTL check to be less frequent (hourly, or even daily), I will still have 1,000,000,000 "mark for deletion" operations, followed immediately by a kajillion segment merges. Oh, and you don't get to choose when the first TTL check happens, so it could be during high use times.

On the other hand, deleting an entire index at once with the index delete API (which is what Curator uses), eliminates every document in a few seconds (because it deletes at the index level), with no more segment merges or disk I/O pain than that.

If you were to compare these two models to SQL commands, the first (TTLs) would be like:

DELETE FROM TABLE WHERE TIMESTAMP < now-24h;

and the second model would be like:

DROP TABLE TABLENAME;

You can see that the first is going to be millions of atomic operations, while the second just drops the entire table. That's what deleting an index vs. TTLs is like, and why TTLs are a Really Bad Idea™ for time-series data.

3 Likes

:blush: what an awesome answer, thank you so much!

I was trying to delete my old logs in Elasticsearch service in aws. i was new to this, can i run this corn job on logstash to delete automatically last 7days logs??? does it works now ?? is there any other better approch ?

This is a horrible approach to data management. You should be using time-series indices or the rollover API rather than attempting a delete-by-query for the reasons listed above (regarding TTL). It will be extremely taxing to your system to do this (I've updated TTL to be delete-by-query):

I highly recommend looking into the Rollover API for a way to simplify this for you. Then you can make your "non-time-series" index into a time-series index for all intents and purposes.

And ask for help with the Rollover API in a new topic, or search for an existing one, as this is off-topic here.