I'm encountering an issue while using the azure-blob-storage input in Filebeat, where I'm consistently getting duplicate entries each time I poll for data.
Here's the setup: We have a container in Azure Blob Storage that contains numerous subfolders, each with multiple files.
When I query the data in Elasticsearch using the 'filebeat_log_file_path' field, I notice that the Filebeat agent is fetching the content of the same file multiple times. For example, the timestamps in Filebeat logs show one entry for a file at 2023-09-28 07:37:00.470, and then another entry for the same file just a few milliseconds later, at 2023-09-28 07:37:00.530.
I'm curious to know if anyone else has experienced a similar issue.
Hi @djesus, This duplication of files is most likely due to the agent/beat crashing and restarting without doing a state save. Versions prior to 8.10 have a concurrency issue that causes the input to crash if used with a max_worker > 1. Updating to 8.10 should fix this.
I've just updated the agent, but regrettably, I'm still encountering the same issue. Interestingly, this behavior isn't consistent across all polled files. For instance, out of 10,000 files being polled, approximately 300 of them end up getting duplicated.
Is there any other aspect I should be investigating or examining?
Hi @djesus, could you please confirm how data if getting shipped from the filebeat to the further level. If it is getting shipped via Logstash to Elasticsearch then below KA would help you.
thanks for your reply! filebeat is sending data to Graylog which in turn stores it in Elasticsearch. I guess i could give it a try to setup a logstash instance as a middle step however this would add complexity to the solution and add another possible point of failure.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.