Fingerprint processor allowing duplicates

I'm ingesting log files that have known duplicates, so have implemented a Fingerprint processor in the ingest pipeline and setting that to _id to remove the duplicates, which works perfectly. However, when the index rolls over (with the ILM policy), and the new backing index doesn't see the duplicate _id in the data stream, so allows the insert.

Is there a recommended approach to handle this duplication?

Unfortunately no, you would need to handle this outside logstash, like making a query to get the duplicate documents and using delete_by_query to remove one of them.

I was afraid of that. So my only option is find the duplicate _id records, and delete one after it's indexed?

Is there maybe a different processor in the ingest pipeline that would identify this? Or maybe something like a unique constraint on the data stream, rather than the index?

There is not, the fingerprint processor is used to do that, but it only works when you have a duplicate _id in the same index, if the backing index changed, then using the same _id as a previous document is not seen as a duplicate in this case.

What you described is an edge case of this processor, so it won't help in this case.

No, the unique constraint is the _id of the document, but the _id must be unique per index, not per data stream/alias.

The solution would be to query your Elasticsearch during ingestion to see if a document id exists or not, but this can be really expensive, I would not recommend that.

How many duplicates do you have? Is this really an issue?

Yes, it is really an issue, as it's skewing metrics, and duplicates will continue to happen each time the index rolls over.

It's a timing issue with the source data in SQL. The write interval is in some cases longer than the query interval. The SQL query intentionally returns the same records multiple times (with some additional each time), to ensure there's eventually a superset of the data. It's not ideal, but they're trying to take the weight of the query off the SQL database, so I've been relying on a "natural key" in the dataset to fingerprint in Elastic.

That use case is really not a good fit for data streams, which are intended to be append-only storage.

You might do better with dated indexes, where logstash will route the duplicates using the date to the same index, and duplicate ids will result in records being overwritten.