Duplicate content when using azure-blob-storage input

djesus · September 28, 2023, 8:25am

Hello everyone,

I'm encountering an issue while using the azure-blob-storage input in Filebeat, where I'm consistently getting duplicate entries each time I poll for data.

Here's the setup: We have a container in Azure Blob Storage that contains numerous subfolders, each with multiple files.

When I query the data in Elasticsearch using the 'filebeat_log_file_path' field, I notice that the Filebeat agent is fetching the content of the same file multiple times. For example, the timestamps in Filebeat logs show one entry for a file at 2023-09-28 07:37:00.470, and then another entry for the same file just a few milliseconds later, at 2023-09-28 07:37:00.530.

I'm curious to know if anyone else has experienced a similar issue.

My environment details:

Filebeat version: 8.9.2 (amd64) libbeat version: 8.9.2 [d355dd57fb3accc7a2ae8113c07acb20e5b1d42a built 2023-08-30 19:39:56 +0000 UTC]
Operating system: RHEL 7.9

I appreciate any insights or help you can provide.

Thanks in advance! Daniel

exdghost · September 28, 2023, 10:02am

Hi @djesus, This duplication of files is most likely due to the agent/beat crashing and restarting without doing a state save. Versions prior to 8.10 have a concurrency issue that causes the input to crash if used with a max_worker > 1. Updating to 8.10 should fix this.

djesus · September 29, 2023, 10:38am

Hello, @exdghost,

Again, thanks for your reply!

I've just updated the agent, but regrettably, I'm still encountering the same issue. Interestingly, this behavior isn't consistent across all polled files. For instance, out of 10,000 files being polled, approximately 300 of them end up getting duplicated.

Is there any other aspect I should be investigating or examining?

Filebeat version: 8.10.2 (amd64)
Libbeat version: 8.10.2
Built: 2023-09-18 18:09:06 +0000 UTC

suman.kumar · October 4, 2023, 9:21am

Hi @djesus, could you please confirm how data if getting shipped from the filebeat to the further level. If it is getting shipped via Logstash to Elasticsearch then below KA would help you.

djesus · October 11, 2023, 7:00am

Hi @suman.kumar

thanks for your reply! filebeat is sending data to Graylog which in turn stores it in Elasticsearch. I guess i could give it a try to setup a logstash instance as a middle step however this would add complexity to the solution and add another possible point of failure.

also im seeing a worse issue, when this is happening (403 auth error when using the filebeat input azure-blob-storage (error: Request date header too old)) and i have to restart filebeat, it will re download everything again from the blob storage.... it seems that the local filebeat db keeping state of what was already downloaded gets corrupted.

system · November 8, 2023, 9:01am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Filebeat sending duplicates events Beats filebeat	2	972	December 23, 2021
Azure-blob-storage input: fatal error: concurrent map iteration and map write Beats filebeat	2	420	July 31, 2023
Filebeat sending the whole log again after stoping and starting filebeat container Beats filebeat	15	6040	November 14, 2018
Filebeat data extra Fields not populating in raw logs Beats filebeat	3	388	January 18, 2021
Duplicate documents using Filebeat Beats	3	2411	July 15, 2016

Duplicate content when using azure-blob-storage input

Related topics