It all works fine, except that occasionally Logstash will send a duplicate log to ES. The data in the duplicated log is all identical, except for the ID. I think this may have something to do with the ES load balancing.
Does anyone have any suggestions how I can diagnose what's going on?
If Beats or Logstash encounter any problems shipping data downstream, they will retry automatically. This means that duplicates can not be avoided in the pipeline. If you however define an ID based on the content of the event any attempt to write the same event twice would result in an update rather than a duplicate event getting created.
Is there a recommended way of implementing this with Metricbeats? I'm not sure how I would uniquely identify the logs without encapsulating the entire beat somehow.
UUID will only help prevent duplicates being created after it was assigned or if it is taken from an external system. If you use Logstash to create the UUID, you could still end up with duplicates if Beats is forced to retry and this results in duplication of the event.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.