Suggestions for S3 input shortcomings, buffering & durability, & redis

Several small issues here. With some common themes. We've been using ElasticSearch for about 18 months for general website search, however this week I setup our first logstash instance and began processing processing a massive amount of ELB logs from S3, as well as some low volume data from our application, sent via redis pubsub and list-queues depending on the durability needs.

The ELB(s) are handling about 1,000-2,500 events/min... Eventually, In the coming days, I would like to ingest Cloudfront logs for nearly 1,000 cloudfront distributions (which aggregated should be about the same amount of events) and about a dozen http facing S3 buckets, which should be another few thousand per sec....

It's become pretty obvious that the S3 input plugin is simply not sufficient. It is mentioned in an older thread here that it will not scale using multiple instances - s3's consistency model and all. I am already having issues with it, from what I assume is using so many workers,... Today I had about 4 hours of logs simply not process via the s3 input,... Logs before and after ingested, but it just skipped over about 70 files and refused to process them until I restarted logstash. I am using the backup_to_bucket + delete after at the moment to ensure no double processing. Finally, I just need very fine grain control over the listobject calls... and some kind of locking so we can scale logstash to more than one instance as well as cope with the non-trivial prefix scheme all the logs use (as well as skipping some without needing to iterate over them at all).

So, to get to the point here. My idea here is to write some custom script to run alongside logstash, using a very carefully crafted distributed locking setup for ListObject, downloading the file, writing the file to local disk (if necessary), - sending it to logstash and receiving an OK - then finally, purging it from the local disk (if necessary), moving it to a new prefix and/or bucket, and then pruning all that metadata when safe.

So, you should see where I am going with this already. If anyone has any ideas or suggestions I am all ears. The locking system will likely be using redis. So, my initial plan was to write this in PHP (don't laugh!),... I am not a ruby developer, at all. My plan was quite simply to download to a temporary location (after checks to ensure it hadn't been downloaded, processed, or already moved out), and then use atomic rename on the local file system to move it to an "incoming" directory and use the files input plug in.

So, just a few questions -
... Assuming my system is perfectly durable and atomic, once I drop the file to logstash - what should I be on the lookout for? I am aware of the durability option for logstash's queue, however I have no idea how this works in relation to the file input method. In other words, how does it cope when it doesn't even make it to the queue?.. In all honesty, I'm not that worried about loosing a few hundred events in the event of a crash or ungraceful termination however point being - - - might as well do it right if I'm doing it. Just concerned due to sheer volume, any small issue will scale up into big ones.

My biggest upset with using the file input method is, I don't really want to deal with trying to tail the logstash "log to file after" to get the list of processed files. Is there no way to just mv the file (move_after?)

... Should I perhaps forego the files altogether and simply read the files myself and push to a redis list, and ingest this list?

... Or should I just do things the logstash way and attempt to write my own ruby plug in - heck, a Java one?... How difficult would this be?.. I've never written a single line of ruby, and barely any Java, but I'm willing to give it a shot. Perhaps I could do a hybrid setup as well, - with a separate system handling the S3 aspect, and simply write up a file-ingest based on receiving the file paths via redis. Or, heck, maybe even S3 paths.

Anyway, as you can see I am quite in the mud here with seemingly no clear and clean solution. I don't mean to be so broad, nor write a manifesto. Just, I'd like to keep things as simple as possible and reuse existing plug-ins/libraries as much as possible. If I don't need to write my own X, the I'd prefer not to.

Surely someone else has run into all of this before. Any tips, suggestions, advice, or words of encouragement?

Thanks!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.