How to handle duplicate records in datastreams using fingerprint

Hi Team,

I am ingesting data from a Kafka topic into Elasticsearch using Logstash.
The incoming data can contain duplicates, so I am using a fingerprint filter on a unique business field (seqId) and setting it as the document _id.

This works correctly within a single backing index — duplicates are not created as long as the data goes into the same index.

However, once the data stream rolls over to a new backing index, I start seeing duplicate documents again, even though the _id generated from the fingerprint remains the same.

Setup details:

  • Ingesting data using Logstash → Elasticsearch data stream

  • Using fingerprint filter:

    fingerprint {
      source => ["seqId"]
      target => "[@metadata][generated_id]"
    }
    
    
  • Using this in the output:

    document_id => "%{[@metadata][generated_id]}"
    
    
  • Data stream rollover is based on time

  • Same seqId can arrive again after rollover

Thanks In Advance.

Hello @venkatkumar229

Data streams are designed for append-only time-series data so they are not suitable when you need global de-duplication based on _id.
_id uniqueness is enforced only within a single backing index so after rollover the same _id can be indexed again in a new backing index (the same behavior seen in your case)

The same thing happens with regular indices using rollover and a write alias because Elasticsearch does not check older indices for existing _ids.

If you require exactly one document per business key (for example seqId), the recommended approach is to use a Transform or will have to avoid rollover of index (which might not be a feasible solution).

Similar post :

Thanks!!

1 Like

As Tortoise said, data streams are append-only - approx. write once, read many times indices.

Check a similar topic, the blog and GHub.

1 Like