Best practices for managing lifecycles of small units of data

japem · February 6, 2024, 5:50pm

I have a use case where I have small-ish units of data (generally <1GB) that I want to be able to manage the lifecycles of separately. Essentially, each user has information that we want to store in a hot data tier during a session to make it quickly searchable, and it needs to be searchable at other times but it can be stored in lower tiers to save on costs. This data may be modified at any time.

The total size of data for all users is on the order of terabytes.

Is it reasonable to store each user as its own index? This seems to be inadvisable, but I'm not sure of another way to manage the lifecycles independently. Could data from different users be rolled into a single index based on last-session time?
How would rehydration of data into the hot tier work if the user's data is already indexed in a lower tier? How would this work if an index contains multiple users and we only want to re-index a single user's data?

japem · February 6, 2024, 6:30pm

One option I see for myself is to use the reindex API to copy hot data down into lower-tier indices, or to copy cooler data up into hot indices, and delete the data from the original index. However this requires storing a lot of state external to the Elasticsearch instance.

Are ingest processors able/allowed to have side effects that could manage the deletion of that data for me?

Christian_Dahlqvist · February 12, 2024, 7:28am

Why do you need different tiers? It doesn't sound like your use case matches the assumptions that generally apply around tiered architectures.

I would recommend having a single tier with reasonably dense nodes. This means that all data is always available and there is no need for a complex process of moving data between tiers. As all indices are in the same tier multiple users can also share indices, which brings down the index and shard count. As a users data volume is reasonably low you can simply use delete-by-query to delete it when necessary. This is not as efficient as dropping a complete index, but likely outweighed by the benefits of simpler architecture and reduced shard count.

system · March 11, 2024, 7:29am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Content_tier Elasticsearch ilm-index-lifecycle-management	3	1155	December 23, 2020
Deal with small indices (ILM Policy) Elasticsearch ilm-index-lifecycle-management	4	344	October 13, 2023
ILM on single-node? Elasticsearch ilm-index-lifecycle-management	7	443	June 19, 2023
Index Lifecycle Management with document ids / routing Elasticsearch ilm-index-lifecycle-management	3	568	April 29, 2020
Index lifecycle for multi-tenant app with different tiers Elasticsearch	2	344	June 23, 2020

Best practices for managing lifecycles of small units of data

Related topics