I have a use case where I have small-ish units of data (generally <1GB) that I want to be able to manage the lifecycles of separately. Essentially, each user has information that we want to store in a hot data tier during a session to make it quickly searchable, and it needs to be searchable at other times but it can be stored in lower tiers to save on costs. This data may be modified at any time.
The total size of data for all users is on the order of terabytes.
- Is it reasonable to store each user as its own index? This seems to be inadvisable, but I'm not sure of another way to manage the lifecycles independently. Could data from different users be rolled into a single index based on last-session time?
- How would rehydration of data into the hot tier work if the user's data is already indexed in a lower tier? How would this work if an index contains multiple users and we only want to re-index a single user's data?
One option I see for myself is to use the
reindex API to copy hot data down into lower-tier indices, or to copy cooler data up into hot indices, and delete the data from the original index. However this requires storing a lot of state external to the Elasticsearch instance.
Are ingest processors able/allowed to have side effects that could manage the deletion of that data for me?
Why do you need different tiers? It doesn't sound like your use case matches the assumptions that generally apply around tiered architectures.
I would recommend having a single tier with reasonably dense nodes. This means that all data is always available and there is no need for a complex process of moving data between tiers. As all indices are in the same tier multiple users can also share indices, which brings down the index and shard count. As a users data volume is reasonably low you can simply use delete-by-query to delete it when necessary. This is not as efficient as dropping a complete index, but likely outweighed by the benefits of simpler architecture and reduced shard count.