I am ingesting data from a Kafka topic into Elasticsearch using Logstash.
The incoming data can contain duplicates, so I am using a fingerprint filter on a unique business field (seqId) and setting it as the document _id.
This works correctly within a single backing index — duplicates are not created as long as the data goes into the same index.
However, once the data stream rolls over to a new backing index, I start seeing duplicate documents again, even though the _id generated from the fingerprint remains the same.
Setup details:
Ingesting data using Logstash → Elasticsearch data stream
Data streams are designed for append-only time-series data so they are not suitable when you need global de-duplication based on _id.
_id uniqueness is enforced only within a single backing index so after rollover the same _id can be indexed again in a new backing index (the same behavior seen in your case)
The same thing happens with regular indices using rollover and a write alias because Elasticsearch does not check older indices for existing _ids.
If you require exactly one document per business key (for example seqId), the recommended approach is to use a Transform or will have to avoid rollover of index (which might not be a feasible solution).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.