Currently, we have a logging pipeline based on ELK (plus Filebeat). For quite some time we've been doing sharding based on a fixed number of shards, usually aiming to have ~40GB per shard. We're currently looking into automating this by using size-based sharding through Index Lifecycle Policies.
In our current setup, we set custom document ids based on some field(s) that we extract from the events. The reasoning behind this was to be able to "replay" data the data ingestion (from Kafka in our case) and be able to re-ingest some data or fill any gaps if we had an outage.
If we move into a size based sharding we lose the ability to reply traffic, because even if we continue to set the document id field it could be (potentially) ingested into different indices, which means that the old version of the document will not be overwritten resulting in duplicated results. We though on potentially using the
routing parameter but that would result in the same situation since both the document id and
routing parameter are based on the underline index behind the write alias (as far as I can tell).
In general: I'm wondering if there is any recommended way of applying an Index Lifecycle Policy (size based in particular) for indices that have custom document ids that would avoid having duplicated events in case that we need to re-ingest some old data.
If anyone can share any tips on a similar setup it would also be greatly appreciated