I have a data stream with page views of multiple websites of customers. I allow customers to delete their data (this is important). When they do I remove all the documents with a certain hostname.
At the same time, I want to keep my search of recent data fast. What works best is having 12 primary shards of recent data. Older data is fine with 3 shards (both on a 3 node cluster).
The most recent data (not older than 1 day) will also be updated with a few extra metrics we collect at the end of the page views (like time on page).
A few options (and their concerns):
- The shrink API but it is creating read-only indices (I need to be able to delete old but specific data)
- Make the data stream writable once every month and run the deletions then (is that possible?)
- Reindex the whole data stream every month to remove the older documents (how to keep writes and updates in the last month working well?)
What would be a way to make the recent data fast in a data stream that sometimes gets updates (in very old data).