Data expiration and ttl


(YG) #1

Given a stream of data coming in every second, but we only want to keep data in the past T time (say 1 hour). What is the best way to expire and remove old data? We did some research and found the following two

  1. Set ttl of each document to T, and ES will automatically black list old data and remove them. One question we have is when and how frequently will the data be physically removed? Is it controlled by indices.ttl.interval or something else?

  2. Use time-frame based indexes, and index data every T time frame. However, this approach might introduce very strange tfidf scores for the latest index when it has very few data. Is there a good way to handle this?

Thanks!


(Mark Walkom) #2

TTL is deprecated and will be removed in upcoming versions.

So you should definitely use time based indices instead.


(YG) #3

Thanks for the reply. However, for time based indices, is there any helper functions for this? We are concern about a new index with very few data in it. It might have very different tfidf values, and might introduce strange search results. Is there any good way to handle these?

Thanks!


(Mark Walkom) #4

Not really, it might be worth looking into just deleting documents directly?
Or perhaps someone else has other ideas.


(YG) #5

Yes directly issuing a request to delete data is also fine. However, what's the difference between ttl and directly delete? They both will only blacklist deleted items and then remove them during segment merging? any particular reason in favor of directly deleting? or just because ttl is getting deprecated so we prefer delete?


(Mark Walkom) #6

TTL means constantly scanning the entire index looking for documents to be deleted, which is expensive.


(system) #7