I want to maintain the lifecycle management of an Elasticsearch doc over x days. _delete_by_query api help with the job, however i understand this also comes with some performance impact on query/read on delete of large data set, add segment merging to it. What is the best way to accomplish this. Would rollover be a better option here ? Incase i have to maintain the state of the doc for 90days or over, would reading from multiple indexes, over large data and aggregation have not impact.
It’s not timeseries data, it is an entity state we have to maintain for 45 days after which we have to auto purge the record. We have records in half billion and we receive updates or to say upserts. ILM I understand is for timeseries data.
The most efficient way to delete data from Elasticsearch or OpenSearch is to delete complete indices. This does however require the use of time-based indices, which complicate performing updates.
The other option is as you pointed out to use delete by query. The reason this is more expensive and can cause performance issues is that it deletes individual documents from indices, which is basically an update operation with a tombstone record that requires both a read and a write.
If the purce date/time is based on the creation date of the document and you have access to this date/time outside Elasticsearch when you perform the insert/update, you may be able to use the older style of time-based indices where each index covers a specific set time period and that is indicated by the index name. When you index a document you would determine the name of the index to write to based on this static timestamp. You would then do the same whenever you update the document.
If the deleteion is not based on creation timestamp, e.g. instead the last updated date, or you do not have access to this when updating you will most likely need to take the hit and rely on delete-by-query instead, which you will have to call from outside Elasticsearch, e.g. through a script or cron job.
If you need to use delete-by-query, it may be worthwhile to try performing smaller deletes more frequently in order to spread out the load.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.