Multiple Logstash Docker containers sharing an S3 input

This is one of those "I can't be the first one" kinda questions.

We're running Logstash server in Docker. We run anywhere between 3 and 6, depending on the logging volume. Our current input is just a Redis box (Elasticache, in reality), and we output to an Elasticsearch cluster.

It all works great, and using Marathon we can scale up and down very easily.

We are now investigating also consuming some AWS generated logging. In particular, the ELB logs, and Cloudtrail logs. These logs are generated by AWS, and placed into an S3 bucket for us.

We'd like to have that S3 bucket as an input for Logstash. But, looking at the S3 Input, it seems each Logstash instance would create it's own sincedb, and then ingest from the last known point recorded in there.

This has two problems for us:

  • Firstly, the sincedb file entails state, and our Logstash containers are stateless.
  • Secondly, the various instances of Logstash aren't visible to each each other, so we'll get the same data read and passed to ES multiple times.

I'm curious why I can't any previous examples of people trying this, because this feels like it should be a pretty common use-case nowadays.

Has anyone else attempted this before? Are we on the totally wrong track here? Any insights would be much appreciated!

Logstash currently has no features for sharing state or otherwise synchronizing actions between instances, including sincedb state for file or s3 inputs. So, no matter what you do you can't have two instances pulling from the same file in S3.

If your containers really need to be stateless I suppose you could run another process inside the container that periodically pushes the sincedb file to S3 or some other Amazon data store. Then, modify the container's startup script to pull the same resource before starting Logstash so that you can kill and restart the container and not lose the persisted state.