I am using filebeat 5.5.1 and logstash 5.5.1 on Debian 8 to collect logs and send them to InfluxDB.
Logstash and filebeat are installed as a whole on 20 servers forwarding logs to a single InfluxDB server.
Logstash-input-beat logstash-output-influxdb and some filter plugins written by myself are used.
They will work fine for a long time , after one month or so things will go wrong.
Logstash writes no warning or error logs. Filebeat log goes like this:
2017-10-03T16:31:17+08:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.read_errors=1 libbeat.logstash.published_but_not_acked_events=4096
2017-10-03T16:31:47+08:00 INFO No non-zero metrics in the last 30s
2017-10-03T16:32:17+08:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.write_bytes=774
2017-10-03T16:32:19+08:00 ERR Failed to publish events caused by: read tcp 127.0.0.1:28184->127.0.0.1:5044: i/o timeout
2017-10-03T16:32:19+08:00 INFO Error publishing events (retrying): read tcp 127.0.0.1:28184->127.0.0.1:5044: i/o timeout
2017-10-03T16:32:47+08:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.read_errors=1 libbeat.logstash.published_but_not_acked_events=4096
2017-10-03T16:33:17+08:00 INFO No non-zero metrics in the last 30s
2017-10-03T16:33:47+08:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.write_bytes=769
2017-10-03T16:33:49+08:00 ERR Failed to publish events caused by: read tcp 127.0.0.1:28240->127.0.0.1:5044: i/o timeout
2017-10-03T16:33:49+08:00 INFO Error publishing events (retrying): read tcp 127.0.0.1:28240->127.0.0.1:5044: i/o timeout
2017-10-03T16:34:17+08:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.read_errors=1 libbeat.logstash.published_but_not_acked_events=4096
I set the jvm heap size to 4gb for logstash. The gcutil and top of logstash is like:
  S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT     GCT   
  0.00   2.72  33.98  29.74  91.43  85.72 249381 5888.720    82    4.375 5893.095
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                                                                                 
 48725 root      20   0 22.159g 4.265g   8116 S   2.7  3.4  20702:43 java   
Restart filebeat won't solve the problem. SIGTERM can't stop logstash. So I have to use SIGKILL and then restart logstash. After this things can work for several weeks but will eventually go wrong again.
Does someone have any idea about what is wrong? Is the heap size too small or is there memory leak somewhere
?