We sending events for vpc flowlogs from multiple AWS accounts into a central s3 bucket and due to the large number of events we are always 5-6 days behind in Elasticsearch. I already set the batch.size to 6000 and batch.delay to 1 without any increase in the readability of events. Do you have any suggestions on how to increase the read number of events from an s3 bucket??
Does the VPC flow logs creates a large number of files like the Cloudtrail logs?
If this is the case, there is not much you can do, the listing process is very slow when you have a large number of files in the bucket, there is an open issues about it, but no updates.
I had a similar issue with logs from Cloudtrail which I was able to at least make the s3 input usable by setting the prefix in the input, but since the prefix can not be changed dynamically I'm using an external tool to edit the logstash configuration file daily.
The issue is that the s3 input will list everything in the bucket, and if you have million of files in the bucket this could take a long time.
If the VPC flowlogs have an structure similar to the Cloudtrail logs, you could try to use the prefix to reduce the number of objects to list, but again, you would need to use some external tool or script to change the prefix.
Thank you so much for you reply! There's a large number of files for vpcflowlogs and this is what causes the issue.. I wish there was a setting in the S3 input to either read files randomly or by the latest date modified. This way i could have multiple logstash instances running and hence increasing the ingestion data. Thank you for your suggestion tho but the change of prefix wont work for us.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.