Deleting documents by date

kernelpanic · April 13, 2017, 10:51am

Hello all, in my reading on this it does appear that the advice is to use time-based indices and delete the index rather than individual documents due to efficiency. Apart from efficiency concerns, are there any other reasons I shouldn't delete documents using a time range query instead of just deleting indices?

Thanks for any help.

theuntergeek · April 13, 2017, 11:11am

It's a lot like comparing a sql DELETE from TABLE where date<YYYY.MM.dd to a sql DROP TABLE. One has potentially millions or billions of rows of atomic operations, where the other is a single operation. Which is more performant?

In terms of operational efficiency, you end up needing a lot of extra read/write operations to actually delete things in a Lucene index. First, you have to find what to delete (read operation). Then you need to mark the documents for deletion (write operation). Then, at the next segment merge, Lucene has to find the documents marked for deletion (read operation), and make new segments without those documents (write operation).

But it's not merely operational efficiency, but segment fragmentation that will be your enemy over time, robbing you of valuable storage efficiency. See http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html for a brilliant view into what happens with segment deletes (which come with document deletes) and how it interrupts the normal, efficient, tiered segments of a lucene index.

This is why we don't recommend delete_by_query as a solution for time-series data.

kernelpanic · April 13, 2017, 11:21am

Thanks very much for getting back to me; we're just using the Elastic Stack for Windows event log collection with probably no more than 2 people running queries against it at any one time, so automatically deleting documents over night (as inefficient as it is) probably wouldn't be a concern in our situation. However as you've pointed out it may affect storage over time so I'll delete by indices instead, though I'm using a yearly index at the moment so I'll have to change the index pattern

Thanks for your help.

theuntergeek · April 13, 2017, 11:41am

Why not use Elasticsearch Curator (which has 2 Windows installation options) with the Rollover action? You can keep data for longer, and create fewer indices, by keeping rollover frequency low (usually by document count). Or rollover by age. It's up to you, but Curator can make an otherwise tedious index deletion ritual something you don't have to worry about as much.

kernelpanic · April 13, 2017, 11:54am

Excellent - I did briefly look at Curator but didn't see the Rollover option; either way I'll use Curator to delete older data.

Thanks once again.

system · May 11, 2017, 12:05pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Delete documents by timestamp Elasticsearch	18	22042	August 3, 2017
Delete_by_query performance optimization Elasticsearch	5	991	September 8, 2020
Design question regarding document expiration Elasticsearch	2	322	May 16, 2019
Deleting time based document -dummy Elasticsearch	5	692	July 5, 2017
ES 5.6 - Delete by query Elasticsearch	8	695	August 11, 2020

Deleting documents by date

Related topics