Index Lifecycle Management with document ids / routing

Jorge_Betancourt · March 17, 2020, 3:59pm

Currently, we have a logging pipeline based on ELK (plus Filebeat). For quite some time we've been doing sharding based on a fixed number of shards, usually aiming to have ~40GB per shard. We're currently looking into automating this by using size-based sharding through Index Lifecycle Policies.

In our current setup, we set custom document ids based on some field(s) that we extract from the events. The reasoning behind this was to be able to "replay" data the data ingestion (from Kafka in our case) and be able to re-ingest some data or fill any gaps if we had an outage.

If we move into a size based sharding we lose the ability to reply traffic, because even if we continue to set the document id field it could be (potentially) ingested into different indices, which means that the old version of the document will not be overwritten resulting in duplicated results. We though on potentially using the routing parameter but that would result in the same situation since both the document id and routing parameter are based on the underline index behind the write alias (as far as I can tell).

In general: I'm wondering if there is any recommended way of applying an Index Lifecycle Policy (size based in particular) for indices that have custom document ids that would avoid having duplicated events in case that we need to re-ingest some old data.

If anyone can share any tips on a similar setup it would also be greatly appreciated

Christian_Dahlqvist · March 17, 2020, 5:00pm

I do not think there is any way fully eliminate duplicates through IDs when using ILM/rollover, although the longer the duration an index covers compared to the delay in late of duplicate events helps reduce the probability.

Jorge_Betancourt · April 1, 2020, 9:48am

Thanks for the reply @Christian_Dahlqvist!

That was my finding as well, I wanted to check if I was no missing on something. The issue is that we will not control the duration of an index (we're very much after the size based approach), and we have a few that will be rolled out several times per day. On a normal operation day we will not have issues, it will only be a problem if we need to re-ingest some data.

system · April 29, 2020, 9:48am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Indexing best practice Elasticsearch	4	453	December 23, 2020
Deal with small indices (ILM Policy) Elasticsearch ilm-index-lifecycle-management	4	344	October 13, 2023
Auto split an index's shard when certain size reached Elasticsearch ilm-index-lifecycle-management	2	780	October 27, 2021
Shard Configuration Elasticsearch ilm-index-lifecycle-management	2	219	August 19, 2022
Duplicates in ElasticSearch when using ILM Rollover Elasticsearch ilm-index-lifecycle-management	2	384	July 21, 2022

Index Lifecycle Management with document ids / routing

Related topics