We are having a lot of problems with our two Logstash instances. The instances have 72 cores and 128GB heap memory per host.
I'll upload the monitoring screenshots below so you can have better insight into our configuration and events per seconds.
300 hosts are sending their log files via filebeat and logstash to elasticsearch.
The problem is that the Logstash goes into the endless GC and it fills all the heap memory. After that, we don't have any activity and logging doesn't help because there's nothing in it, even in debug mode, it just stops working.
Here’s are the config files:
node.name: "our.host.com" path.data: "/var/lib/logstash" http.host: "our.host.com" http.port: 9600 log.level: debug path.logs: /var/log/logstash xpack.monitoring.enabled: "true" xpack.monitoring.elasticsearch.url: ["http://our.elkhost.com:9200"]
- pipeline.id: main path.config: "/etc/logstash/pipelines/main/*.conf"
-Xms128g -Xmx128g -XX:+UseG1GC -XX:MaxGCPauseMillis=500 -Djruby.compile.invokedynamic=true -Djruby.jit.threshold=0 -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintClassHistogram -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime # log GC status to a file with time stamps # ensure the directory exists -Xloggc:/tmp/logstash-gc.log # Entropy source for randomness -Djava.security.egd=file:/dev/urandom
example from gc.log
2018-02-26T14:50:21.068+0100: 1856.041: [GC pause (G1 Evacuation Pause) (young) Desired survivor size 436207616 bytes, new threshold 15 (max 15) , 0.1091606 secs] [Parallel Time: 99.3 ms, GC Workers: 48] [GC Worker Start (ms): Min: 1856042.4, Avg: 1856042.9, Max: 1856043.2, Diff: 0.8] [Ext Root Scanning (ms): Min: 3.9, Avg: 7.1, Max: 18.1, Diff: 14.2, Sum: 341.2] [Update RS (ms): Min: 79.9, Avg: 90.3, Max: 93.6, Diff: 13.6, Sum: 4336.2] [Processed Buffers: Min: 64, Avg: 133.6, Max: 214, Diff: 150, Sum: 6413] [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0] [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0] [Object Copy (ms): Min: 0.0, Avg: 0.1, Max: 0.3, Diff: 0.3, Sum: 2.6] [Termination (ms): Min: 0.0, Avg: 0.4, Max: 0.7, Diff: 0.7, Sum: 20.8] [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 48] [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.8] [GC Worker Total (ms): Min: 97.7, Avg: 98.0, Max: 98.5, Diff: 0.8, Sum: 4704.7] [GC Worker End (ms): Min: 1856140.9, Avg: 1856140.9, Max: 1856140.9, Diff: 0.0] [Code Root Fixup: 0.1 ms] [Code Root Purge: 0.0 ms] [Clear CT: 1.6 ms] [Other: 8.2 ms] [Choose CSet: 0.0 ms] [Ref Proc: 3.5 ms] [Ref Enq: 0.0 ms] [Redirty Cards: 2.1 ms] [Humongous Register: 0.8 ms] [Humongous Reclaim: 0.2 ms] [Free CSet: 0.0 ms] [Eden: 0.0B(6528.0M)->0.0B(6528.0M) Survivors: 0.0B->0.0B Heap: 125.7G(128.0G)->125.7G(128.0G)]
And the screenshots.
If you need configuration for our filters, I’ll gladly post it, just let me know, but we've tried with a simple filter and it was the same.
It seems that there's some kind of memory link.
I would be very grateful if someone has any idea that could help us. We tried putting the queue to persistent, changed the values for checkpoints, acks, queue max bytes, etc and nothing helped.
Thanks in advance,