Design question regarding document expiration

We started to work with ES2.4 in our project to index the content of a xml cache. This cache has an eviction process (that triggers every 30s) that removes expired documents from the index (thousands of them). To avoid using delete by query, we created an structure as follows:

  • if the cache time is 1 hour we have 4 indexes, each one contains 15min of documents.
  • After every 15min, a new index is created and the oldest is removed.

This was done in this way to avoid the massive document deletion that could provoke many segment merging, at least when we talk to you some years ago ...

Now with ES 6.4, is it recommended to follow with this design, or we can simply apply the delete by query to one index even if we are removing thousands of documents every 30s?

The basic principle still holds today: deleting a whole index is much cheaper than deleting some subset of the documents from an index. Elasticsearch is still based on Lucene, and Lucene still operates on (mostly-)immutable segments and uses merges to clean up deleted documents asynchronously. But, as with all performance questions, the only correct answer is "it depends". You can only truly find out with an experiment to show how it behaves with your specific usage pattern.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.