Fingerprint processor allowing duplicates

mgordon · September 17, 2024, 6:10pm

I'm ingesting log files that have known duplicates, so have implemented a Fingerprint processor in the ingest pipeline and setting that to _id to remove the duplicates, which works perfectly. However, when the index rolls over (with the ILM policy), and the new backing index doesn't see the duplicate _id in the data stream, so allows the insert.

Is there a recommended approach to handle this duplication?

leandrojmp · September 17, 2024, 7:44pm

Unfortunately no, you would need to handle this outside logstash, like making a query to get the duplicate documents and using delete_by_query to remove one of them.

mmg · September 17, 2024, 8:11pm

I was afraid of that. So my only option is find the duplicate _id records, and delete one after it's indexed?

Is there maybe a different processor in the ingest pipeline that would identify this? Or maybe something like a unique constraint on the data stream, rather than the index?

leandrojmp · September 17, 2024, 8:23pm

There is not, the fingerprint processor is used to do that, but it only works when you have a duplicate _id in the same index, if the backing index changed, then using the same _id as a previous document is not seen as a duplicate in this case.

What you described is an edge case of this processor, so it won't help in this case.

No, the unique constraint is the _id of the document, but the _id must be unique per index, not per data stream/alias.

The solution would be to query your Elasticsearch during ingestion to see if a document id exists or not, but this can be really expensive, I would not recommend that.

How many duplicates do you have? Is this really an issue?

mmg · September 17, 2024, 8:43pm

Yes, it is really an issue, as it's skewing metrics, and duplicates will continue to happen each time the index rolls over.

It's a timing issue with the source data in SQL. The write interval is in some cases longer than the query interval. The SQL query intentionally returns the same records multiple times (with some additional each time), to ensure there's eventually a superset of the data. It's not ideal, but they're trying to take the weight of the query off the SQL database, so I've been relying on a "natural key" in the dataset to fingerprint in Elastic.

Badger · September 18, 2024, 1:53am

That use case is really not a good fit for data streams, which are intended to be append-only storage.

You might do better with dated indexes, where logstash will route the duplicates using the date to the same index, and duplicate ids will result in records being overwritten.

Topic		Replies	Views
Duplicated Document Not Updating using Fingerprint In Ingest Pipeline Logs	0	7	September 12, 2024
Logstash x handling duplicate Logstash	3	223	December 21, 2022
Avoid duplicates via node ingest pipelines Logs	10	2095	December 3, 2021
Configuring Pipeline To Handle Duplicates In Rollover Indices Logstash	3	1000	September 20, 2019
Duplicate IDs across rollover indices Elasticsearch	6	2308	September 23, 2019

Fingerprint processor allowing duplicates

Related topics