A huge number of Listblobs commands are being executed from filebeat to Azure Storage

syo04suke26 · May 20, 2024, 3:17am

[background]
I am building a data pipeline to visualize application logs using Elastic Stack.
Application log files are stored as json format blob files on Azure Storage.
The architecture overview is as follows.
Application → Azure Blob Storage → filebeat → logstash → elasticsearch → kibana
This pipeline is working fine and the application logs are now visible on kibana.

[problem]
When checking the access log on Azure Storage, it appears that over 10,000 ListBlobs requests are being made from filebeat per minute.
The problem with this is that the cost of the List command is increasing.

When checking the number of files in Storage, approximately 5000 files were contained within the polled container.
Since the polling interval is 30s, I'm guessing that filebeat executes the ListBlobs command as many times as there are files at each polling interval.

[question]

Is my assumption above (filebeat executes the ListBlobs command as many times as there are blob files) correct as a specification of filebeat?
Is there anything I can do to avoid executing a large number of ListBlobs commands?

[Additional information_My environment]
OS: almalinux 8.8 (Azure VM)
filebeat version: 8.13.4

[Already tried]
All of the following had no effect.
・Delete checkpoint file and restart
・Reinstall filebeat

[debug log]
The following three lines of messages are repeatedly output.

May 20 12:09:12 my-vm filebeat[1894132]: {"log.level":"debug","@timestamp":"2024-05-20T12:09:12.795+0900","log.logger":"input.azure-blob-storage","log.origin":{"function":"github.com/elastic/beats/v7/x-pack/filebeat/input/azureblobstorage.(*scheduler).scheduleOnce","file.name":"azureblobstorage/scheduler.go","file.line":141},"message":"scheduler: 0 jobs scheduled for current batch","service.name":"filebeat","id":"storage","input_source":"my-stg_acnt::my-ctr","account_name":"my-stg_acnt","container_name":"my-ctr","ecs.version":"1.6.0"}
May 20 12:09:12 my-vm filebeat[1894132]: {"log.level":"debug","@timestamp":"2024-05-20T12:09:12.795+0900","log.logger":"input.azure-blob-storage","log.origin":{"function":"github.com/elastic/beats/v7/x-pack/filebeat/input/azureblobstorage.(*scheduler).scheduleOnce","file.name":"azureblobstorage/scheduler.go","file.line":154},"message":"scheduler: total objects read till now: 3895\nscheduler: total jobs scheduled till now: 0","service.name":"filebeat","id":"storage","input_source":"my-stg_acnt::my-ctr","account_name":"my-stg_acnt","container_name":"my-ctr","ecs.version":"1.6.0"}
May 20 12:09:12 my-vm filebeat[1894132]: {"log.level":"debug","@timestamp":"2024-05-20T12:09:12.797+0900","log.logger":"input.azure-blob-storage","log.origin":{"function":"github.com/elastic/beats/v7/x-pack/filebeat/input/azureblobstorage.(*scheduler).scheduleOnce","file.name":"azureblobstorage/scheduler.go","file.line":107},"message":"scheduler: 1 blobs fetched for current batch","service.name":"filebeat","id":"storage","input_source":"my-stg_acnt::my-ctr","account_name":"my-stg_acnt","container_name":"my-ctr","ecs.version":"1.6.0"}

[filebeat settings (partial excerpt)]

filebeat.inputs:
- type: azure-blob-storage
  id: storage
  enabled: true
  tags: ["my_dev_log"]
  ignore_older: 1h
  account_name: mystgacnt
  auth.shared_credentials.account_key: dummy_key
  containers:
  - name: my_cnt
    max_workers: 1
    poll: true
    poll_interval: 30s
    file_selectors:
    - regex: '/app/logs/'

syo04suke26 · May 21, 2024, 1:43pm

I will describe what I found out through my additional research.
-ListBlobs command is executed every poll_interval +1 times for the number of files in the container.
-According to the contents of the log, the scheduleOnce function has been executed the number of times listed above.

Can someone please answer whether the above movement is according to the specifications? I am very troubled.

leandrojmp · May 21, 2024, 2:09pm

Hello and welcome,

If I'm not wrong, yes, this is how it works, it will poll the storage container each poll interval and list the files in the container to know the files it needs to download and process.

It is mentioned in the documentation that this can get expensive if you have a large number of files.

Polling is generally recommented for most cases even though it can get expensive with dealing with a very large number of files.

I don't think so, this is how polling work when consuming data from blob storage, this is the same thing as polling on S3 or GCS, with large buckets it can get expensive.

One alternative would be to change your data pipeline and put your logs inside an Event Hub instead of storage containers.

brablc · January 3, 2025, 4:20pm

I can see that there are as many API calls to ListBlobs as there are blobs in the container per poll. And it seems to me it could be a problem with setting pager size to 1 (default MaxWorkers) at [filebeat][input] Azure blob storage [refactor] (#33112) · elastic/beats@867ad49 · GitHub .

exdghost · January 20, 2025, 8:03am

@brablc, Right now the page size is linked with the max workers to distribute jobs evenly across workers but this inherently causes more frequent api calls and is not an optimal solution as we initially thought. Currently you can increase the max workers which in turn increase the max pagesize, later we will have an improvement to decouple these two values so page/batch size can be set independently.

brablc · January 20, 2025, 7:50pm

Thanks for confirmation and tip. However, we decided to replace Filebeat with own code.

Topic		Replies	Views
Can i store my indexes on Azure Blob Elasticsearch	2	499	July 6, 2017
Logstash is not processing files from input plugin Azure Storage blob, even though new files are available to process Logstash	3	1698	December 16, 2016
Logstash-input-azureblob slow Logstash	1	863	June 27, 2018
How to read one specific file out of multiple files from BLOB storage Logstash	4	652	July 27, 2018
Filebeat keeps open files, s3 upload fail Beats filebeat	2	671	September 13, 2017

A huge number of Listblobs commands are being executed from filebeat to Azure Storage

Related topics