I am using filebeat 5.5.1 and logstash 5.5.1 on Debian 8 to collect logs and send them to InfluxDB.
Logstash and filebeat are installed as a whole on 20 servers forwarding logs to a single InfluxDB server.
Logstash-input-beat logstash-output-influxdb and some filter plugins written by myself are used.
They will work fine for a long time , after one month or so things will go wrong.
Logstash writes no warning or error logs. Filebeat log goes like this:
2017-10-03T16:31:17+08:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.read_errors=1 libbeat.logstash.published_but_not_acked_events=4096
2017-10-03T16:31:47+08:00 INFO No non-zero metrics in the last 30s
2017-10-03T16:32:17+08:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.write_bytes=774
2017-10-03T16:32:19+08:00 ERR Failed to publish events caused by: read tcp 127.0.0.1:28184->127.0.0.1:5044: i/o timeout
2017-10-03T16:32:19+08:00 INFO Error publishing events (retrying): read tcp 127.0.0.1:28184->127.0.0.1:5044: i/o timeout
2017-10-03T16:32:47+08:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.read_errors=1 libbeat.logstash.published_but_not_acked_events=4096
2017-10-03T16:33:17+08:00 INFO No non-zero metrics in the last 30s
2017-10-03T16:33:47+08:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.write_bytes=769
2017-10-03T16:33:49+08:00 ERR Failed to publish events caused by: read tcp 127.0.0.1:28240->127.0.0.1:5044: i/o timeout
2017-10-03T16:33:49+08:00 INFO Error publishing events (retrying): read tcp 127.0.0.1:28240->127.0.0.1:5044: i/o timeout
2017-10-03T16:34:17+08:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.read_errors=1 libbeat.logstash.published_but_not_acked_events=4096
I set the jvm heap size to 4gb for logstash. The gcutil and top of logstash is like:
S0 S1 E O M CCS YGC YGCT FGC FGCT GCT
0.00 2.72 33.98 29.74 91.43 85.72 249381 5888.720 82 4.375 5893.095
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
48725 root 20 0 22.159g 4.265g 8116 S 2.7 3.4 20702:43 java
Restart filebeat won't solve the problem. SIGTERM can't stop logstash. So I have to use SIGKILL and then restart logstash. After this things can work for several weeks but will eventually go wrong again.
Does someone have any idea about what is wrong? Is the heap size too small or is there memory leak somewhere
?