I've been looking into how best to model data whilst accounting for customisable data retention periods at a user / document level vs the whole index.
ILM seems perfect when reasoning about the whole index, old versions of elasticsearch appear to have had TTLs but there doesn't appear to be any guidance / best practices on how best to approach scenarios when needing finer grained policies.
The options appear to be:
- Create indexes per user + time window for most control but could end up with lots of small indexes.
- Try to record multiple users with the same retention period into an index for the retention period + time window, but may be less flexible in terms of changing retention periods and could still end up with small or very large indexes over time.
- Record all normally in indexes per time window and having a separate cleanup process to find and remove the relevant data once expired.
Any insight on how others may be approaching this? or other options solutions i may not have considered would be appreciated.
As for thinking about the data, can presume it to be a time series of user generated data where each user generating data can have a custom retention period defined in days upto a year, a growing number of users around the 10s of thousands range, with some users inevitably generating more data and overall totals of a few million records a day.