I am using the S3 input plugin on a project I am working on. Each day a couple hundred thousand files are sent to different buckets, then processed by logstash. I am using the options to backup the data to the same bucket and delete the original file after it is processed to help speed up the processing time, but it is still very slow. I am currently using no prefix, but I am using an exclude pattern. I noticed in the plugin code that that just means every key will be pulled and the ones that match the exclude pattern will be ignored. We have somewhere north of 10M files and each day that number grows. Looping through all of these each time we check for new files seems very inefficient. To get around this, my current solution is to add a different input section in my logstash config for each separate prefix that I could have. I would end up with about 18 different inputs each in 5 different configs. It would likely only be a one time change, but it doesn't seem like a great setup. My question is if that would be the best way to go about doing it, or if there is another possibility? It doesn't seem possible to use wildcards or an array in the prefix field, so I assume I would need a separate input for each. Another solution would be to backup to a different bucket, but I'm not sure if that is possible due to current constraints.
Your scenario as described probably warrants multiple LS instances with multiple pipelines having multiple prefixed s3 inputs. I'd speculate that you get better throughput (network and in LS), isolation and monitoring than with a single s3 input with a patch for an array of prefixes.
Are you using Centralised Pipeline Management to alter the prefix (decommission an old one and repurpose a config)?
We are not using Centralised Pipeline Management, as we don't have X-Pack and are using the elasticsearch and kibana that are built into AWS. I'm not sure if X-Pack can be enabled there.
Our current setup is a separate logstash instance and pipeline with a single input each pointing at specific bucket. So 5 buckets, and 5 instances of logstash with a single input in each instance. We have no prefixes, and one exclude pattern in each. Each of the buckets also have a subdirectory structure, that would make prefixing easy, although mildly messy due to how many inputs/prefixes would be needed to cover everything.
I will likely be able to get new buckets set up for processed data, so that should help with the speed considerably. If that is not good enough, I will most likely also look at adding separate pipelines with the appropriate prefixes for any buckets that aren't performing as expected. The messiest solution would be a pipeline per subdirectory per bucket, which would yield 108 pipelines on 5 instances, but I highly doubt that will be necessary.
I think moving the "processed" data out of the buckets should yield the results I want though. When we were first starting, with a smaller subdirectory structure, and less files, the lag time for a file to be processed and data to be seen in Kibana was minimal (usually 1-2 minutes from file creation in the bucket for all of the data). It has just grown worse over time, to the point now where something needed to be fixed.
I didn't realize the S3 plugin was looping through every key and just not processing files that match the exclude pattern until today though. I should have probably checked that sooner due to the AWS API only supporting prefixes to limit the keys returned, but it wasn't an issue when all of this was started. Cutting that loop from a few million keys (and growing) for each bucket down to likely well under 100 at a time should get our lag time down to something manageable if not non existent. The number of events that we are dealing with per bucket is only going to be around a constant 540 per minute for our worst bucket. Each of the files is also well under 1 KB of data. A logstash pipeline should be able to handle all of those fairly easily, provided we aren't also forcing it to loop through a couple million file names that it won't process like it currently is.
I just wanted to check here first to see if there was something obvious that I was missing, or a smarter way to handle the prefixing/excluding keys. It appears as though moving the processed data out of the buckets, and adding more pipelines to certain instances of LS when needed will have to be the route I take though. Thanks for your answer.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.