Delete 7 days old data everyday


(Ram Nathan) #1

Hi

I want to delete 7 days old data in my index everyday. I have not used time based index. In that scenario, will curator help? If so, I prefer to write java scheduler,can i write a java code using curator? Or Is it only python?


(David Pilato) #2

That will be very inefficient but you can call DELETE BY QUERY API from a script that you can put in your crontab or from any Java app you want.

Using daily indices and dropping them every 7 days will be much much more efficient.


(Ram Nathan) #3

In that case, this is my scenario, I will create two indices during installation. All data will get loaded there. First index need to purge 7 days old data. Second index purging time frame depends on user input. Every type in second index will have a column which will tell days to retain. How to use time based indices for second index?


(David Pilato) #4

Every type in second index will have a column which will tell days to retain.

Instead of sending the data to index-foo, send it to index-foo-date-to-retain like index-foo-2018-11-20.

On day 20/11/2018, drop all indices where name is index-foo-2018-11-20 for example.


(Ram Nathan) #5

In case of reindexing/migration, how to migrate all data? Should I migrate data from all indices?


(David Pilato) #6

You can use the reindex API probably.


(Ram Nathan) #7

Can I use alias in case of time based index? Basically I need a fixed name for my index to refer in java /kibana


(David Pilato) #8

Yes you can.


(Ram Nathan) #9

can you please refer me a documentation to create time based indices using java API(elasticsearch 5.6.3)?


(David Pilato) #10

Not really. That's just a question of index naming...

So using https://www.elastic.co/guide/en/elasticsearch/client/java-rest/current/java-rest-high-create-index.html

CreateIndexRequest request = new CreateIndexRequest("twitter-2018-11-13");

Just make the index name based on the current time...
And use index templates (which is not something you must do using the Java API)


(Tek Chand) #11

@RAM_NATHAN, as per my knowledge you need to create one indices per day with date timestamp. Then you can use curator to delete indices in very effective way.

I am already using curator to delete 20 old days indices automatically from elasticsearch.

Thanks.


(Oleksandr Gavenko) #12

DELETE BY QUERY API just marks records as deleted.

It doesn't reduce index size or data size.

You can reindex your data though but it is unusual usage pattern of ES.

You made problem from nothing ))


(Ram Nathan) #13

sorry I'm not clear. Our product is in production with Kibana 1.x. Now we are migrating to 5.6.x. Creating time based indices would need impact analysis and other checks. So we thought as of now we will stick to DELETE BY QUERY API. But if it wont get deleted, I dont know why this API in place.


(Oleksandr Gavenko) #14

It is here (delete API) because people need to delete some documents, to no longer see them in search responses.

If you search across product line - you don't have another option - you need to delete + reindex.

If you work with time series it is cumbersome and you need to go with time based indexes and dropping old unused indexes.


(Ram Nathan) #15

thanks for the reply. Whats the solution then?? I need to purge data once in a while


(Ram Nathan) #16

Can you please reply? Using time based indices is the only option to delete the data??


(Christian Dahlqvist) #17

When you are using delete-by-query to manage retention, you require a lot more processing that if you use time-base indices. As you typically also tend to delete the oldest data, which tends to be located in the largest and oldest segments. The data will actually only be removed from disk once these segments are merged and this can time as a lot of data in these segments need to get deleted before they are subject to merging. You can get around this by explicitly issuing a force merge command after the delete, but this is also a quite expensive operation.

The conclusion is that by not using time-based indices you require a lot of expensive extra processing that will prevent you from getting the most from your cluster.


(Ram Nathan) #18

Thanks. I misunderstood earlier comments as, DELETE_BY_QUERY not at all deletes the data. So I'm taking this as DELETE_BY_QUERY deletes the data,but it took lot of processing, at the end can cause performance issues


(Christian Dahlqvist) #19

It deletes the data, but that does not necessarily mean that disk space will be freed up immediately.