Been trying to figure out our problem for the past week and haven't figured out a solution other than restart logstash multiple times when it runs out of resources. Looking for any help or suggestions on what we can try, as of right now i'm doing ingests in chunks and
Problem in a nutshell
The s3 plugin appears to continue to open file descriptors until it reaches the limit of open file descriptors at which point logstash becomes totally unresponsive and effectively is hung. If the file limits count is set to something small like 4, 8 or 16K files that's what happens, when I set it to a higher count (e.g. 32K or 64K files) it appears that it reaches a certain limit (around 25K file descriptors open and then completely chokes from lack of memory where the jvm heap is at roughly 95%+ of the 8GB allocated. Either way, it seems the root problem is from too many open file descriptors.
Related threads
There are a handful of topics, issues, describing similar problems, none are smoking guns and most seem unresolved
Logstash crashing with exception IOError: Too many open files · Issue #4815 · elastic/logstash · GitHub (OPEN since 2018)
File descriptors are leaked when using HTTP · Issue #1604 · elastic/logstash · GitHub (CLOSED)
elasticsearch - Logstash close file descriptors? - Stack Overflow
Configuration & Pipeline
Been mainly running this in a docker container with Xmx and Xms set to 8GB. Have tried running logstash directly on our linux machine (no docker) and the results are the same.
Here are the config files, pretty simple overall
Pipeline in a nutshell, we have several pipelines reading data from s3. Each s3 input points to a separate bucket that has multiple subfolders (prefixes), each subfolder has ~200 files/day and we're trying at the moment to ingest about 90 days worth of data so about 18K files per pipeline.
logstash.yml - nothing special configured, basically using defaults
pipelines.yml - example of pipeline
- pipeline.id: sample
path.config: "sample-pipeline.conf"
queue.type: persisted
sample-pipeline.conf
input { s3 {} }
filter { mutate, kv, grok, drop, fingerprint }
output { elasticsearch {} }
Attempted Fixes
I've played with the open file limits, allocated JVM RAM, and other configuration options, but at the end of the day I can't figure out how to get file descriptors to close which leads to either reaching the limit or heap to max out and GC takes over and dominates all free resources. I've attempted a few things on the input side, like setting option delete option so after the file is read from s3 it is deleted. Still, no luck and so we're stuck at the solution of just running batches of files, shutdown logstash and restart.