Using fingerprint and document_id for at-least-once delivery and dedupe with ILM

tomr · June 10, 2021, 6:55am

TL;DR

Is ILM compatible with at-least-once-delivery / deduplication / idempotence?

Longer version..

I routinely use a fingerprint filter combined with the elasticsearch output's document_id setting for at-least-once delivery.

Using named indices (with date math), this had some nice properties, especially around the ability to replay logs in the event of partial ingest or changed processing logic, without worrying about duplicates ending up in my indices. There was always a maximum of one document for any given _id, and the same source data always ended up in the same ES index.

But now that I'm moving to ILM, this no longer works, because I can't rely on the document going into the same index.

Is this just a known limitation that I have to accept and deal with? Or is there a way to make the elasticsearch output quasi-idempotent even with ILM enabled?

warkolm · June 10, 2021, 7:05am

ILM is built on the idea that the indices are more-or-less read only once they have been rolled.

If you want to update the old data, you need to talk to the index it lives in rather than the ILM alias, which is outside the scope of what ILM is designed to do.

tomr · June 10, 2021, 7:10am

Thanks Mark.

So I'm 100% clear - there is no way to get the benefits of ILM and have at-least-once pipelines as discussed above - correct?

tomr · June 10, 2021, 7:21am

I see this has been discussed at some length on github.

It's clear that, at this time, I can't have my cake and eat it too.

system · July 8, 2021, 7:21am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.