Using s3 input plugin logstash file_descriptors aren't closed leading to crash of system

petersedivec · November 4, 2020, 12:52pm

Been trying to figure out our problem for the past week and haven't figured out a solution other than restart logstash multiple times when it runs out of resources. Looking for any help or suggestions on what we can try, as of right now i'm doing ingests in chunks and

Problem in a nutshell

The s3 plugin appears to continue to open file descriptors until it reaches the limit of open file descriptors at which point logstash becomes totally unresponsive and effectively is hung. If the file limits count is set to something small like 4, 8 or 16K files that's what happens, when I set it to a higher count (e.g. 32K or 64K files) it appears that it reaches a certain limit (around 25K file descriptors open and then completely chokes from lack of memory where the jvm heap is at roughly 95%+ of the 8GB allocated. Either way, it seems the root problem is from too many open file descriptors.

Related threads

There are a handful of topics, issues, describing similar problems, none are smoking guns and most seem unresolved

Logstash crashing with exception IOError: Too many open files · Issue #4815 · elastic/logstash · GitHub (OPEN since 2018)
File descriptors are leaked when using HTTP · Issue #1604 · elastic/logstash · GitHub (CLOSED)
elasticsearch - Logstash close file descriptors? - Stack Overflow

Configuration & Pipeline

Been mainly running this in a docker container with Xmx and Xms set to 8GB. Have tried running logstash directly on our linux machine (no docker) and the results are the same.

Here are the config files, pretty simple overall

Pipeline in a nutshell, we have several pipelines reading data from s3. Each s3 input points to a separate bucket that has multiple subfolders (prefixes), each subfolder has ~200 files/day and we're trying at the moment to ingest about 90 days worth of data so about 18K files per pipeline.
logstash.yml - nothing special configured, basically using defaults
pipelines.yml - example of pipeline

pipeline.id: sample
path.config: "sample-pipeline.conf"
queue.type: persisted

sample-pipeline.conf

input { s3 {} }
filter { mutate, kv, grok, drop, fingerprint }
output { elasticsearch {} }

Attempted Fixes

I've played with the open file limits, allocated JVM RAM, and other configuration options, but at the end of the day I can't figure out how to get file descriptors to close which leads to either reaching the limit or heap to max out and GC takes over and dominates all free resources. I've attempted a few things on the input side, like setting option delete option so after the file is read from s3 it is deleted. Still, no luck and so we're stuck at the solution of just running batches of files, shutdown logstash and restart.

petersedivec · November 5, 2020, 7:33am

Was thinking about posting this as a bug on github but figured I'd attempt getting support here first. our workaround of restarting logstash a few times a day is working for now but not ideal and definitely feels like this should be a bug report. Not sure if it's an S3 plugin issue or a wider issue

petersedivec · November 13, 2020, 6:02am

ok, so pretty certain this is an issue and looking to open an issue on github related to this thread. It continues to happen to us with our dataset, have bumped the JVM heap for logstash to 16GB and that allows us to push a little more data but eventually things still crap out with too many open file descriptors.

With 8GB heap in logstash things run well up to about 25K open file descriptors at which point JVM heap is at 99% and GC consumes all the CPU effectively shutting our pipeline down.
At 16GB heap we get up to 50K open file descriptors before the same issue happens.

Am curious if anyone can explain the 8GB recommendation for JVM Heap from elastic, why not allocate more? Why is the recommendation 4-8GB?

Badger · November 13, 2020, 3:07pm

Increasing the heap size comes with costs. Not just the memory usage, but an increase in the number of GC roots that remain in memory, which drives up the cost of running GC.

Depending on the pattern of memory usage, you may see an increase in GC costs as heap allocation increases (not as a percentage, but as an absolute value). In some case those increases can be significant.

petersedivec · November 16, 2020, 8:03am

@Badger - thanks, appreciate the feedback. I've been reading up and found this article very useful https://www.elastic.co/blog/a-heap-of-trouble. I'm starting to understand the inner workings of elastic and java applications a bit more.

system · December 14, 2020, 8:03am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Logstash input S3 module problem Logstash	8	362	July 27, 2023
Logstash crashing with "Too many files open" Logstash	9	6560	July 6, 2017
Logstash Too many open files Logstash	8	10064	July 6, 2017
Disk Space gets full when file descriptors reaches to its limit Logstash	1	489	November 6, 2018
Logstash - file input plugin with huge number of files Logstash	4	843	July 6, 2017

Using s3 input plugin logstash file_descriptors aren't closed leading to crash of system

Problem in a nutshell

Related threads

Configuration & Pipeline

Attempted Fixes

Related topics