Hi,
I'm looking at avoiding duplicated entries when indexing logs indexed to elasticsearch from filebeat via logstash. To do that I'll be using the logstash fingerprint module...
I'd like to use more than the message field to ensure uniqueness of generated 'ids' without using the @timestamp field sent by filebeat. Given that, could I use offset alongside message to ensure unique id generation?
offset + message + some datetime range should probably do it
without some datetime range there would be a chance of removing valid logs if you are harvesting a rotating log that will repeat the offset, and might re-send the same message at the same offset at a different time
that said, afaik you shouldn't be getting duplicates from filebeat unless there are problems ACKing from the output to filebeat, or some other cases at edgy scenarios. If you are receiving such amount of dupes that you need to filter them, probably there is an originating issue behind it we should be considering.
The problem with keying on timestamp is that I can't be sure filebeat isn't generating a timestamp for some portion of logs on transmission. So, in scenarios where filebeat resends logs that would be a problem.
Regarding duplicate transmits from filebeat we have seen retransmits where the logstash input flaps as the pipeline to elasticsearch blocks and becomes available again in rapid succession. It's an edge case I'm trying to account for.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.