I'm working on an ELK-stack installation.
And there's a logstash instance which for some unknown yet reason periodically (once in 2-10 days) stops inserting data into ES.
And the most confusing part is that the logstash does not indicate there's something wrong with it.
I mean the logstash process does not crash.
The logs do not show any errors. Even more, it keeps adding some logs as if everything is perfect (jdbc_static plugin adds logs every hour).
The only way I may know that the logstash stopped inserting data into ES is by looking into kibana and see no new data.
My current thoughts about this issue are:
I have a lot of devices sending data into logstash (with the lumberjack input plugin). Maybe once in a while my pipeline gets overloaded with the input data (or because of a lot of processing before the output). Logstash is a multi-thread app, so maybe some threads, responsible for either input or output, get blocked.
But the app itself continues working.
And eventually the only way to "unblock" those threads (and revitalize the logstash) is to re-launch the logstash. And wait another 2-10 days when it stops inserting data.
But this is a baaaaad way of solving the issue.
Has anyone had that kind of issues before?
How do I make the logstash to unblock its own threads or restart them?
Or how do I make the logstash to restart itself when its threads are blocked?
Maybe there's a known way to approach that problem.
Info about my installation:
- ELK stack version 6.6.0.
- openjdk version "1.8.0_201"
- OS Amazon Linux AMI 2018.03
- pipeline.workers: 8
- LS_JAVA_OPTS=" -Xmx2g -Xms2g"
- Logstash is accepting data with the lumberjack input plugin (with congestion_threshold => 15).
- I have a lot of stuff happening inside the filter section of the config. There's grok, mutate, date, ruby, jdbc_streaming and jdbc_static plugins.
- I also have many devices sending data. Each device can send up to 1000 events per time, every 30 seconds.
Thank you for any help!