TLDR; - Can I write a lambda which acts as a Beat in order to send the content of a S3-hosted file to a logstash cluster through a load balanced connection in order to distribute the workload, since the S3 input plugin in logstash cannot distribute that workload amongst multiple instances?
I'm in the process of bringing up a full ELK stack to support a reasonably sizable application infrastructure. I've got various backend services, some implemented as custom apps in node.js and some running as hosted services. As a simple intro to the ELK stack, I grabbed the most consistent structured logs I could find in our app - the logs from Amazon's Elastic Load Balancers, which front each of the custom web services. My eventual intent is to load both a structured access log from the those apps as well as a couple of days of the mostly unstructured debug/trace logs, and to even do some alerting on the differences between response time as reported from the external source (the load balancer) vs the internal response time of the backend service itself - in order to detect when latency is happening between the incoming TCP packet and the packet making it all the way up to the application layer for processing or the outbound packet sitting TCP buffers after being sent.
Elastic Load Balancer logs can only be written to an S3 bucket, though different balancers can use a different prefix for the key or a different bucket entirely.
But logstash has no way to distribute content generated by the S3 input plugin across multiple logstash instances - if I setup the input plugin on multiple instances, they will potentially all process the same file. I can put some kind of id in the record which identifies the file and then drop events which already exist in ES, but that will still cause each logstash instance to do some work prior to dropping events, resulting in poor scalability and lots of wasted effort and expense. It occurs to me, however, it would be relatively simple to install a Lambda on the S3 bucket which will fire whenever a new file is delivered, which could then just send each line as a separate message to multiple logstash instances fronted by a load balancer. Since amazon's lambda infrastructure will launch a separate lambda for each object stored in a bucket, and it will retry any that fail, including any that timeout due to backpressure, this seems like an ideal way to get a distributed architecture for processing S3-based content. The lambda acts as a host with a Beat installed in it, and the real work of processing and parsing will still happen in logstash.
What is giving me pause is that I don't see other people talking about this same problem. I can hardly be the first person to need to process logs from S3 via logstash who wants to have a properly distributed and redundant system for doing so, so surely someone out there has already solved this problem? If so, what mechanism did you use? I thought about a separate logstash instance which has a Beat output plugin that can feed into the load balanced logstash cluster, but there doesn't appear to be a Beat output plugin for logstash. Plus, then my S3 logstash instance is a single point of failure in front of the logstash cluster, while S3 lambdas are instantiated on a per-object basis, which is much more redundant. If it fails, it won't be hard to launch another instance to pick up the work which will just accumulate in S3, but if the quatity of logs gets too high for a single instance, I'll have to use cute tricks with prefixes to distribute workload amongst multiple instances, etc. It just feels kludgy, while the Lambda solution seems pretty much ideal. Each file briefly gets a Lambda to act as host for the Beat to distribute the file's contents.
Anyone have code for doing this already?