Deleting documents by date

Hello all, in my reading on this it does appear that the advice is to use time-based indices and delete the index rather than individual documents due to efficiency. Apart from efficiency concerns, are there any other reasons I shouldn't delete documents using a time range query instead of just deleting indices?

Thanks for any help.

It's a lot like comparing a sql DELETE from TABLE where date<YYYY.MM.dd to a sql DROP TABLE. One has potentially millions or billions of rows of atomic operations, where the other is a single operation. Which is more performant?

In terms of operational efficiency, you end up needing a lot of extra read/write operations to actually delete things in a Lucene index. First, you have to find what to delete (read operation). Then you need to mark the documents for deletion (write operation). Then, at the next segment merge, Lucene has to find the documents marked for deletion (read operation), and make new segments without those documents (write operation).

But it's not merely operational efficiency, but segment fragmentation that will be your enemy over time, robbing you of valuable storage efficiency. See http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html for a brilliant view into what happens with segment deletes (which come with document deletes) and how it interrupts the normal, efficient, tiered segments of a lucene index.

This is why we don't recommend delete_by_query as a solution for time-series data.

Thanks very much for getting back to me; we're just using the Elastic Stack for Windows event log collection with probably no more than 2 people running queries against it at any one time, so automatically deleting documents over night (as inefficient as it is) probably wouldn't be a concern in our situation. However as you've pointed out it may affect storage over time so I'll delete by indices instead, though I'm using a yearly index at the moment so I'll have to change the index pattern :slight_smile:

Thanks for your help.

Why not use Elasticsearch Curator (which has 2 Windows installation options) with the Rollover action? You can keep data for longer, and create fewer indices, by keeping rollover frequency low (usually by document count). Or rollover by age. It's up to you, but Curator can make an otherwise tedious index deletion ritual something you don't have to worry about as much.

Excellent - I did briefly look at Curator but didn't see the Rollover option; either way I'll use Curator to delete older data.

Thanks once again.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.