Rollover Index duplication data,data coming from logstash

As Stephen describes, rollover was designed to help solve a common but quite specific problem: Create indices and shards of reasonably uniform size when data volumes fluctuate over time and have immutable data.

It is now more or less the standard and is often recommended on this forum even for scenarios and use cases where it is not really suitable or at least not a natural fit. A long time ago I wrote a blog post about duplicate prevention and in this I discuss the problems using rollover if you need to perform duplicate prevention the way you describe. Even though it is old, I think it is at least largely still applicable.

Before rollover was available in Elasticsearch the standard way of handling time-series data was to create indices with the time period they cover reflected in the name. This could be daily (e.g. logstash-2023.05.07), weekly (e.g. logstash-2023.14) or monthly (e.g. logstash-2023.05). The time period each index would cover would be set up front in the Logstash cobfiguration as this determines the index name to write to. The number of primary shards would be adjusted periodically through an index tempaste depending on the projected data volumes so that the expected shard size would not be too large. With this approach each piece of data, if associated with a specific timestamp, can be uniquely sent to one specific index, which allows for deduplication.

This approach naturally has the drawback that the size of indices and shards can vary significantly, which is something that is generally not desirable.

I believe it should be possible to use ILM with indices created according to the above described scheme as the use of rollover should be optional. It will however as far as i know delete indices based on when they were created, so will not automatically take the timestamp component of the index name into account, which may or may not be an issue.

As an alternative to ILM there is the old and trusted Curator. This runs externally to the cluster through cron and has more features than ILM and therefore offers more flexibility.

The best way to handle this would however IMHO be to make sure you avoid duplicates when you extract your data in the first place. This may mean you may need to enhance your perl scripts or possibly switch to some other extraction mechanism. It is generally a good idea to stick with the standard recommended approach (rollover and data streams) if possible. If this is not an option I would recommend looking into the options I outlined above.

1 Like