Duplicate content when using azure-blob-storage input

Hello everyone,

I'm encountering an issue while using the azure-blob-storage input in Filebeat, where I'm consistently getting duplicate entries each time I poll for data.

Here's the setup: We have a container in Azure Blob Storage that contains numerous subfolders, each with multiple files.

When I query the data in Elasticsearch using the 'filebeat_log_file_path' field, I notice that the Filebeat agent is fetching the content of the same file multiple times. For example, the timestamps in Filebeat logs show one entry for a file at 2023-09-28 07:37:00.470, and then another entry for the same file just a few milliseconds later, at 2023-09-28 07:37:00.530.

I'm curious to know if anyone else has experienced a similar issue.

My environment details:

  • Filebeat version: 8.9.2 (amd64) libbeat version: 8.9.2 [d355dd57fb3accc7a2ae8113c07acb20e5b1d42a built 2023-08-30 19:39:56 +0000 UTC]
  • Operating system: RHEL 7.9

I appreciate any insights or help you can provide.

Thanks in advance! Daniel

Hi @djesus, This duplication of files is most likely due to the agent/beat crashing and restarting without doing a state save. Versions prior to 8.10 have a concurrency issue that causes the input to crash if used with a max_worker > 1. Updating to 8.10 should fix this.

Hello, @exdghost,

Again, thanks for your reply!

I've just updated the agent, but regrettably, I'm still encountering the same issue. Interestingly, this behavior isn't consistent across all polled files. For instance, out of 10,000 files being polled, approximately 300 of them end up getting duplicated.

Is there any other aspect I should be investigating or examining?

Filebeat version: 8.10.2 (amd64)
Libbeat version: 8.10.2
Built: 2023-09-18 18:09:06 +0000 UTC

Hi @djesus, could you please confirm how data if getting shipped from the filebeat to the further level. If it is getting shipped via Logstash to Elasticsearch then below KA would help you.

Hi @suman.kumar

thanks for your reply! filebeat is sending data to Graylog which in turn stores it in Elasticsearch. I guess i could give it a try to setup a logstash instance as a middle step however this would add complexity to the solution and add another possible point of failure.

also im seeing a worse issue, when this is happening (403 auth error when using the filebeat input azure-blob-storage (error: Request date header too old)) and i have to restart filebeat, it will re download everything again from the blob storage.... it seems that the local filebeat db keeping state of what was already downloaded gets corrupted.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.