Given a stream of data coming in every second, but we only want to keep data in the past T time (say 1 hour). What is the best way to expire and remove old data? We did some research and found the following two
Set ttl of each document to T, and ES will automatically black list old data and remove them. One question we have is when and how frequently will the data be physically removed? Is it controlled by indices.ttl.interval or something else?
Use time-frame based indexes, and index data every T time frame. However, this approach might introduce very strange tfidf scores for the latest index when it has very few data. Is there a good way to handle this?
Thanks for the reply. However, for time based indices, is there any helper functions for this? We are concern about a new index with very few data in it. It might have very different tfidf values, and might introduce strange search results. Is there any good way to handle these?
Yes directly issuing a request to delete data is also fine. However, what's the difference between ttl and directly delete? They both will only blacklist deleted items and then remove them during segment merging? any particular reason in favor of directly deleting? or just because ttl is getting deprecated so we prefer delete?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.