I'm using Filebeat to send data to Logstash + Elasticsearch server. I noticed that if the Logstash+Elasticsearch server is not fast enough I start to see duplicate entries in elasticsearch. Is this a bug? What options do I have to remove these duplicates?
When there is back-pressure Logstash will close the connection from Filebeat. So for example if a batch of 10 events was in the process of being sent and in the middle of that send operation, the connection gets closed without an acknowledgement back to Filebeat for those 10 events. Then Filebeat must resend all 10 to guarantee at-least-once semantics even if in actuality some of them were successfully sent previously.
If your problem is caused by Logstash closing the connection then you will see some indicators in your Logstash logs (look for circuit breakers). You can mitigate this problem by setting the congestion_threshold to a very high value (years) to disable it.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.