A huge number of Listblobs commands are being executed from filebeat to Azure Storage

[background]
I am building a data pipeline to visualize application logs using Elastic Stack.
Application log files are stored as json format blob files on Azure Storage.
The architecture overview is as follows.
Application → Azure Blob Storage → filebeat → logstash → elasticsearch → kibana
This pipeline is working fine and the application logs are now visible on kibana.

[problem]
When checking the access log on Azure Storage, it appears that over 10,000 ListBlobs requests are being made from filebeat per minute.
The problem with this is that the cost of the List command is increasing.

When checking the number of files in Storage, approximately 5000 files were contained within the polled container.
Since the polling interval is 30s, I'm guessing that filebeat executes the ListBlobs command as many times as there are files at each polling interval.

[question]

  • Is my assumption above (filebeat executes the ListBlobs command as many times as there are blob files) correct as a specification of filebeat?
  • Is there anything I can do to avoid executing a large number of ListBlobs commands?

[Additional information_My environment]
OS: almalinux 8.8 (Azure VM)
filebeat version: 8.13.4

[Already tried]
All of the following had no effect.
・Delete checkpoint file and restart
・Reinstall filebeat

[debug log]
The following three lines of messages are repeatedly output.

May 20 12:09:12 my-vm filebeat[1894132]: {"log.level":"debug","@timestamp":"2024-05-20T12:09:12.795+0900","log.logger":"input.azure-blob-storage","log.origin":{"function":"github.com/elastic/beats/v7/x-pack/filebeat/input/azureblobstorage.(*scheduler).scheduleOnce","file.name":"azureblobstorage/scheduler.go","file.line":141},"message":"scheduler: 0 jobs scheduled for current batch","service.name":"filebeat","id":"storage","input_source":"my-stg_acnt::my-ctr","account_name":"my-stg_acnt","container_name":"my-ctr","ecs.version":"1.6.0"}
May 20 12:09:12 my-vm filebeat[1894132]: {"log.level":"debug","@timestamp":"2024-05-20T12:09:12.795+0900","log.logger":"input.azure-blob-storage","log.origin":{"function":"github.com/elastic/beats/v7/x-pack/filebeat/input/azureblobstorage.(*scheduler).scheduleOnce","file.name":"azureblobstorage/scheduler.go","file.line":154},"message":"scheduler: total objects read till now: 3895\nscheduler: total jobs scheduled till now: 0","service.name":"filebeat","id":"storage","input_source":"my-stg_acnt::my-ctr","account_name":"my-stg_acnt","container_name":"my-ctr","ecs.version":"1.6.0"}
May 20 12:09:12 my-vm filebeat[1894132]: {"log.level":"debug","@timestamp":"2024-05-20T12:09:12.797+0900","log.logger":"input.azure-blob-storage","log.origin":{"function":"github.com/elastic/beats/v7/x-pack/filebeat/input/azureblobstorage.(*scheduler).scheduleOnce","file.name":"azureblobstorage/scheduler.go","file.line":107},"message":"scheduler: 1 blobs fetched for current batch","service.name":"filebeat","id":"storage","input_source":"my-stg_acnt::my-ctr","account_name":"my-stg_acnt","container_name":"my-ctr","ecs.version":"1.6.0"}

[filebeat settings (partial excerpt)]

filebeat.inputs:
- type: azure-blob-storage
  id: storage
  enabled: true
  tags: ["my_dev_log"]
  ignore_older: 1h
  account_name: mystgacnt
  auth.shared_credentials.account_key: dummy_key
  containers:
  - name: my_cnt
    max_workers: 1
    poll: true
    poll_interval: 30s
    file_selectors:
    - regex: '/app/logs/'

I will describe what I found out through my additional research.
-ListBlobs command is executed every poll_interval +1 times for the number of files in the container.
-According to the contents of the log, the scheduleOnce function has been executed the number of times listed above.

Can someone please answer whether the above movement is according to the specifications? I am very troubled.

Hello and welcome,

If I'm not wrong, yes, this is how it works, it will poll the storage container each poll interval and list the files in the container to know the files it needs to download and process.

It is mentioned in the documentation that this can get expensive if you have a large number of files.

Polling is generally recommented for most cases even though it can get expensive with dealing with a very large number of files.

I don't think so, this is how polling work when consuming data from blob storage, this is the same thing as polling on S3 or GCS, with large buckets it can get expensive.

One alternative would be to change your data pipeline and put your logs inside an Event Hub instead of storage containers.