[background]
I want to configure a data pipeline that uses filebeat to collect log data output to Azure Storage and output it to Logstash.
Blob files are append-based and are logged approximately every minute. The log output specifications depend on the product specifications and cannot be changed.
[problem]
When testing with the above configuration, Filebeat reads the append blob from the beginning every time it detects an append, resulting in duplicate logs being output to Logstash.
[question]
Is it possible to use filebeat to output only the appended portion of a Blob file in Append format?
To workaround this, I guess you could download the file from azure blob storage at some interval and replace the previous local copy, filebeat will track its location in the file and each time you replace it, it will start from the previous offset and read until the end. This will cause you to download the same file many times and the azure command line tools do not support delta downloads of append blobs afaik.
Alternatively, may be able to setup a notification and use an Azure function to read the append blob starting at the offset, push the events to Event Hub and then use the Event Hub input to send them to Elasticsearch.
If you're able to change the application:
If the application eventually rotates to a new log file after appending for a period of time, you could focus your input just on the rotated log files and avoid reading the appended files.
Append blobs are optimized for continuous appending, one write operation per minute isn't a high throughput append use-case and could just be to a new blob file each time. I don't believe Azure pricing is any different between these two scenarios.
Finally, as this is a missing feature in Beats, I would recommend making an issue in the Integrations repository for the azure_blob_storage integration here: GitHub · Where software is built and an issue in the Beats repo for the azure-blob-storage input here: GitHub · Where software is built and have them link to eachother.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.