Unacceptable Logstash startup times

There has been a disturbing trend with Logstash performance and stability. Including stability issues (6.1.3 is the last version that can run a long time on MacOS without crashing), the 6.6.x containers crash within minutes with most of our pipelines (although some of those same pipelines run fine natively installed on Ubuntu 18.04), and startup time have grown to the point of being completely unacceptable. Here are three different examples of our collection, processing and indexing instances:

Collection Pipelines

6.1.3 - 0:08

[2019-02-24T08:06:59,455][INFO ][logstash.runner          ] Starting Logstash {"logstash.version"=>"6.1.3"}
[2019-02-24T08:07:07,108][INFO ][logstash.agent           ] Pipelines running {:count=>5, :pipelines=>["collect-beats", "collect-ipfix", "collect-netflow", "collect-sflow", "collect-syslog"]}

6.6.1 - 0:26

[2019-02-24T07:54:56,510][INFO ][logstash.runner          ] Starting Logstash {"logstash.version"=>"6.6.1"}
[2019-02-24T07:55:22,494][INFO ][logstash.agent           ] Pipelines running {:count=>5, :running_pipelines=>[:"collect-syslog", :"collect-netflow", :"collect-ipfix", :"collect-beats", :"collect-sflow"], :non_running_pipelines=>[]}

Processing Pipelines

6.1.3 - 4:25

[2019-02-24T08:15:17,540][INFO ][logstash.runner          ] Starting Logstash {"logstash.version"=>"6.1.3"}
[2019-02-24T08:19:42,369][INFO ][logstash.agent           ] Pipelines running {:count=>30, :pipelines=>["process-filebeat-suricata_eve", "process-ipfix", "process-netflow", "process-sflow", "process-shared-conn", "process-syslog", "process-syslog-access", "process-syslog-apache_httpd", "process-syslog-application", "process-syslog-barracuda_cgfw", "process-syslog-blackridge_tac", "process-syslog-checkpoint", "process-syslog-cisco_ios", "process-syslog-citrix_netscaler_appfw", "process-syslog-clamav", "process-syslog-dns_logger", "process-syslog-dnsmasq", "process-syslog-forcepoint_ngfw", "process-syslog-fortinet_fortios", "process-syslog-iptables", "process-syslog-juniper_junos", "process-syslog-lastline_enterprise", "process-syslog-network", "process-syslog-nginx", "process-syslog-palo_alto", "process-syslog-squid", "process-syslog-system", "process-syslog-tripwire", "process-syslog-ulogd", "process-winlogbeat"]}

6.6.1 - 44:35 (!!!)

[2019-02-24T07:12:37,895][INFO ][logstash.runner          ] Starting Logstash {"logstash.version"=>"6.6.1"}
[2019-02-24T07:57:12,720][INFO ][logstash.agent           ] Pipelines running {:count=>30, :running_pipelines=>[:"process-sflow", :"process-syslog-nginx", :"process-syslog-juniper_junos", :"process-syslog-forcepoint_ngfw", :"process-ipfix", :"process-syslog-iptables", :"process-netflow", :"process-syslog-fortinet_fortios", :"process-syslog-blackridge_tac", :"process-syslog-system", :"process-syslog", :"process-syslog-tripwire", :"process-syslog-ulogd", :"process-syslog-cisco_ios", :"process-syslog-lastline_enterprise", :"process-winlogbeat", :"process-syslog-barracuda_cgfw", :"process-syslog-checkpoint", :"process-filebeat-suricata_eve", :"process-syslog-palo_alto", :"process-syslog-apache_httpd", :"process-shared-conn", :"process-syslog-dns_logger", :"process-syslog-access", :"process-syslog-citrix_netscaler_appfw", :"process-syslog-application", :"process-syslog-dnsmasq", :"process-syslog-network", :"process-syslog-clamav", :"process-syslog-squid"], :non_running_pipelines=>[]}

Indexing Pipelines

6.1.3 - 0:06

[2019-02-24T08:06:26,226][INFO ][logstash.runner          ] Starting Logstash {"logstash.version"=>"6.1.3"}
[2019-02-24T08:06:31,829][INFO ][logstash.agent           ] Pipelines running {:count=>30, :pipelines=>["index-filebeat-suricata_eve", "index-ipfix", "index-netflow", "index-sflow", "index-syslog-access", "index-syslog-apache_httpd", "index-syslog-application", "index-syslog-barracuda_cgfw", "index-syslog-blackridge_tac", "index-syslog-cef", "index-syslog-checkpoint", "index-syslog-cisco_ios", "index-syslog-citrix_netscaler_appfw", "index-syslog-clamav", "index-syslog-dns_logger", "index-syslog-dnsmasq", "index-syslog-forcepoint_ngfw", "index-syslog-fortinet_fortios", "index-syslog-generic", "index-syslog-iptables", "index-syslog-juniper_junos", "index-syslog-lastline_enterprise", "index-syslog-network", "index-syslog-nginx", "index-syslog-palo_alto", "index-syslog-squid", "index-syslog-system", "index-syslog-tripwire", "index-syslog-ulogd", "index-winlogbeat”]}

6.6.1 - 2:33

[2019-02-24T07:40:17,555][INFO ][logstash.runner          ] Starting Logstash {"logstash.version"=>"6.6.1"}
[2019-02-24T07:42:50,408][INFO ][logstash.agent           ] Pipelines running {:count=>30, :running_pipelines=>[:"index-syslog-cisco_ios", :"index-ipfix", :"index-syslog-ulogd", :"index-syslog-lastline_enterprise", :"index-syslog-network", :"index-syslog-forcepoint_ngfw", :"index-syslog-clamav", :"index-syslog-citrix_netscaler_appfw", :"index-syslog-squid", :"index-sflow", :"index-syslog-checkpoint", :"index-syslog-application", :"index-syslog-system", :"index-syslog-tripwire", :"index-syslog-barracuda_cgfw", :"index-syslog-dnsmasq", :"index-syslog-palo_alto", :"index-syslog-nginx", :"index-syslog-iptables", :"index-syslog-juniper_junos", :"index-syslog-cef", :"index-filebeat-suricata_eve", :"index-netflow", :"index-winlogbeat", :"index-syslog-blackridge_tac", :"index-syslog-dns_logger", :"index-syslog-generic", :"index-syslog-apache_httpd", :"index-syslog-access", :"index-syslog-fortinet_fortios"], :non_running_pipelines=>[]}

For now we have been forced to standardize on Logstash 6.1.3 (with updated plugins). It would be good to eventually be able to upgrade, but it seems that no real testing is being done beyond very very simple configurations.

Is there any effort going into reversing this trend?

Have you checked out the jruby startup problem stuff?

Mark, hacking stuff behind the scenes cannot be the answer, especially when some of those changes bring negative side-effects like "at the expense of long-running code speed".

There has been a consistent degradation of Logstash performance and stability since 6.1.3 that points to insufficient testing with real-world usage patterns. While we would like to stick with Logstash, if this trend doesn't change, we will be forced to move on to something else.

It's a pretty common issue with jruby that we suggest everyone look at as a solution to similar questions. Waving it off as "hacking stuff behind the scenes" certainly won't fix anything.

It's also very hard for us to test real-world, because what does that even mean? We do run a suite of tests including standardised apache logs, but is that what you consider real world? If not what is, can you provide config examples of the ones above that consistently show degradation, have you raised GitHub issues around the performance drops you are seeing?

We'd be more than happy to look into things further, but insinuating crappy coding practises on behalf of the team, and then not providing any concrete information for us to investigate with, it just doesn't help anyone unfortunately.

I created the thread here because on GitHub it specifically says "Please post all product and debugging questions on our forum. Your questions will reach our wider community members there, and if we confirm that there is a bug, then we can open a new issue here." So no I haven't raised GitHub issues... yet. However I now have clarity on what I need to do next.

If you've got specific configs with timings, then raising those as an issue would be great :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.