I need to implement high availability of Logstash reading log files from S3.
Is there any way to implement HA via scaleout without duplicating the events?
Each VM is going to store until which file has read, so I will have duplicated logs...
If I share the file (via NFS) that stores time of last processed file, both instantes will compete for same file and probably will compete for reading same files again.
You may be able to have that if you change to using Filebeat to get the logs, but you will also need to enable SQS notifications and use the Filebeat aws-s3 input to consume messages from SQS.
Another alternative is to uncouple the process of getting the logs and processing it, but this would need you to add other tools in your stack.
I have a scenario where I have multiple Logstash processing data from S3, but for this to work I have the following structure.
Custom Python Script to Download the Files -> Vectordev to read the Files and put the lines on Kafka Topics -> Multiple Logstash consuming from Kafka.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.