Hello everyone,
I have a pipeline where I store two types of entities in Elasticsearch, providing fast search filters and aggregation searches for my users.
- Entity Type 1: Needs to be stored indefinitely.
- Entity Type 2: Can be deleted after one year.
The data flowing through the pipeline is inconsistent. Some days it’s only 50-100 documents, while other days it can reach millions of documents.
- In the first 24 hours of an item’s insertion, there are frequent updates.
- After 24 hours, updates are less frequent but can still involve large-scale updates (1-5 million records in extreme cases).
Currently, I’m opening a new index daily for each entity type (so, 2 indexes per day, 4 shards total per day). As a result, I’m left with a large number of very small indexes in my cluster, as well as a high number of shards, which doesn’t seem efficient.
I’ve been considering using data streams, but I’ve read that they are meant for immutable data, and I’m unsure whether this approach fits my use case.
I’m looking for advice on how to optimize my index management, Thank you!