[background]
I am building a data pipeline to visualize application logs using Elastic Stack.
Application log files are stored as json format blob files on Azure Storage.
The architecture overview is as follows.
Application → Azure Blob Storage → filebeat → logstash → elasticsearch → kibana
This pipeline is working fine and the application logs are now visible on kibana.
[problem]
When checking the access log on Azure Storage, it appears that over 10,000 ListBlobs requests are being made from filebeat per minute.
The problem with this is that the cost of the List command is increasing.
When checking the number of files in Storage, approximately 5000 files were contained within the polled container.
Since the polling interval is 30s, I'm guessing that filebeat executes the ListBlobs command as many times as there are files at each polling interval.
[question]
- Is my assumption above (filebeat executes the ListBlobs command as many times as there are blob files) correct as a specification of filebeat?
- Is there anything I can do to avoid executing a large number of ListBlobs commands?
[Additional information_My environment]
OS: almalinux 8.8 (Azure VM)
filebeat version: 8.13.4
[Already tried]
All of the following had no effect.
・Delete checkpoint file and restart
・Reinstall filebeat
[debug log]
The following three lines of messages are repeatedly output.
May 20 12:09:12 my-vm filebeat[1894132]: {"log.level":"debug","@timestamp":"2024-05-20T12:09:12.795+0900","log.logger":"input.azure-blob-storage","log.origin":{"function":"github.com/elastic/beats/v7/x-pack/filebeat/input/azureblobstorage.(*scheduler).scheduleOnce","file.name":"azureblobstorage/scheduler.go","file.line":141},"message":"scheduler: 0 jobs scheduled for current batch","service.name":"filebeat","id":"storage","input_source":"my-stg_acnt::my-ctr","account_name":"my-stg_acnt","container_name":"my-ctr","ecs.version":"1.6.0"}
May 20 12:09:12 my-vm filebeat[1894132]: {"log.level":"debug","@timestamp":"2024-05-20T12:09:12.795+0900","log.logger":"input.azure-blob-storage","log.origin":{"function":"github.com/elastic/beats/v7/x-pack/filebeat/input/azureblobstorage.(*scheduler).scheduleOnce","file.name":"azureblobstorage/scheduler.go","file.line":154},"message":"scheduler: total objects read till now: 3895\nscheduler: total jobs scheduled till now: 0","service.name":"filebeat","id":"storage","input_source":"my-stg_acnt::my-ctr","account_name":"my-stg_acnt","container_name":"my-ctr","ecs.version":"1.6.0"}
May 20 12:09:12 my-vm filebeat[1894132]: {"log.level":"debug","@timestamp":"2024-05-20T12:09:12.797+0900","log.logger":"input.azure-blob-storage","log.origin":{"function":"github.com/elastic/beats/v7/x-pack/filebeat/input/azureblobstorage.(*scheduler).scheduleOnce","file.name":"azureblobstorage/scheduler.go","file.line":107},"message":"scheduler: 1 blobs fetched for current batch","service.name":"filebeat","id":"storage","input_source":"my-stg_acnt::my-ctr","account_name":"my-stg_acnt","container_name":"my-ctr","ecs.version":"1.6.0"}
[filebeat settings (partial excerpt)]
filebeat.inputs:
- type: azure-blob-storage
id: storage
enabled: true
tags: ["my_dev_log"]
ignore_older: 1h
account_name: mystgacnt
auth.shared_credentials.account_key: dummy_key
containers:
- name: my_cnt
max_workers: 1
poll: true
poll_interval: 30s
file_selectors:
- regex: '/app/logs/'