Logstash dies because of to high load

Hi,

we have the following setup:

  • Many Clients with collectors sidecar and filebeat installed, sending their logs to logstash
  • Some Clients that send their openshift logs via fluentd to logstash
  • Two Graylog / Logstash Servers that receive the logs
  • Three elasticsearch Nodes that store the logs

If the graylog / logstash servers are offline for some time and we fire them up again we see ~7000msg/s incoming in the graylog gui but just for maybe one minute.
Than the messages drop to 0 and one or maybe two cpu cores on the graylog / logstash nodes show 100% but nothing more happens.

We already tried to play with workers/batch settings for logstash, that brought some effect, now its running fine for around 10 minutes but than dies..
We also tried to use the throttle filter, but this will slow down messages to ~10/s and after one day logstash also dies..

The machines have 64vcores and ~50GB Ram now... but nothing helps.

Here are the software versions:

logstash-5.4.1-1.noarch
graylog-server-2.2.3-1.noarch
logstash-codec-cef (4.1.2)
logstash-codec-collectd (3.0.3)
logstash-codec-dots (3.0.2)
logstash-codec-edn (3.0.2)
logstash-codec-edn_lines (3.0.2)
logstash-codec-es_bulk (3.0.3)
logstash-codec-fluent (3.1.1)
logstash-codec-graphite (3.0.2)
logstash-codec-json (3.0.2)
logstash-codec-json_lines (3.0.2)
logstash-codec-line (3.0.2)
logstash-codec-msgpack (3.0.2)
logstash-codec-multiline (3.0.3)
logstash-codec-netflow (3.4.0)
logstash-codec-plain (3.0.2)
logstash-codec-rubydebug (3.0.2)
logstash-filter-clone (3.0.2)
logstash-filter-csv (3.0.2)
logstash-filter-date (3.1.5)
logstash-filter-dissect (1.0.8)
logstash-filter-dns (3.0.3)
logstash-filter-drop (3.0.2)
logstash-filter-fingerprint (3.0.3)
logstash-filter-geoip (4.0.4)
logstash-filter-grok (3.4.0)
logstash-filter-json (3.0.2)
logstash-filter-kv (4.0.0)
logstash-filter-metrics (4.0.2)
logstash-filter-mutate (3.1.3)
logstash-filter-ruby (3.0.2)
logstash-filter-sleep (3.0.3)
logstash-filter-split (3.1.1)
logstash-filter-syslog_pri (3.0.2)
logstash-filter-throttle (4.0.1)
logstash-filter-urldecode (3.0.3)
logstash-filter-useragent (3.0.3)
logstash-filter-uuid (3.0.2)
logstash-filter-xml (4.0.2)
logstash-input-beats (3.1.12)
logstash-input-couchdb_changes (3.1.1)
logstash-input-elasticsearch (4.0.3)
logstash-input-exec (3.1.2)
logstash-input-file (4.0.0)
logstash-input-ganglia (3.1.0)
logstash-input-gelf (3.0.2)
logstash-input-generator (3.0.2)
logstash-input-graphite (3.0.2)
logstash-input-heartbeat (3.0.2)
logstash-input-http (3.0.4)
logstash-input-http_poller (3.1.1)
logstash-input-imap (3.0.2)
logstash-input-irc (3.0.2)
logstash-input-jdbc (4.2.0)
logstash-input-kafka (5.1.7)
logstash-input-log4j (3.0.5)
logstash-input-lumberjack (3.1.1)
logstash-input-pipe (3.0.2)
logstash-input-rabbitmq (5.2.3)
logstash-input-redis (3.1.2)
logstash-input-s3 (3.1.4)
logstash-input-snmptrap (3.0.2)
logstash-input-sqs (3.0.3)
logstash-input-stdin (3.2.2)
logstash-input-syslog (3.2.0)
logstash-input-tcp (4.1.0)
logstash-input-twitter (3.0.3)
logstash-input-udp (3.1.0)
logstash-input-unix (3.0.3)
logstash-input-xmpp (3.1.2)
logstash-output-cloudwatch (3.0.4)
logstash-output-csv (3.0.3)
logstash-output-elasticsearch (7.3.1)
logstash-output-file (4.0.1)
logstash-output-graphite (3.1.1)
logstash-output-http (4.2.0)
logstash-output-irc (3.0.2)
logstash-output-kafka (5.1.6)
logstash-output-nagios (3.0.2)
logstash-output-null (3.0.2)
logstash-output-pagerduty (3.0.3)
logstash-output-pipe (3.0.2)
logstash-output-rabbitmq (4.0.7)
logstash-output-redis (3.0.3)
logstash-output-s3 (4.0.7)
logstash-output-sns (4.0.3)
logstash-output-sqs (4.0.1)
logstash-output-statsd (3.1.1)
logstash-output-stdout (3.1.0)
logstash-output-tcp (4.0.0)
logstash-output-udp (3.0.2)
logstash-output-webhdfs (3.0.2)
logstash-output-xmpp (3.0.2)
logstash-patterns-core (4.1.0)

Here is the logstash config

output {
  if [kubernetes_host]
  {
      http {
        http_method => "post"
        url => "http://localhost:12201/gelf"
        codec => "json"
      }
  }
    if [type] == 'syslog'
  {
      http {
        http_method => "post"
        url => "http://localhost:12205/gelf"
        codec => "json"
      }
  }
  }


filter{

    # Rate Limiting, da ein Wiederanlauf sonst Ueberlastung bedeutet
     throttle {
       before_count => 3
       after_count => 5
       period => 3600
       max_age => 7200
       key => "%{host}%{message}"
       add_tag => "throttled"
     }
     if "throttled" in [tags] {
       sleep {
         time => "0.1"
       }
     }

    # Message um den Quelldateinamen erweitern (wuerde sonst bei logstash verloren gehen)
    mutate{
        add_field => {
            "file_path" => "%{source}"
        }
    }
}
input {
  beats {
    port => "5044"
    ssl_verify_mode => "force_peer"
    ssl_certificate => "/etc/pki/tls/certs/xxxxxxx.crt"
    ssl_key => "/etc/pki/tls/certs/xxxxxxxxxx.pkcs.key"
    ssl_certificate_authorities => ["/etc/puppetlabs/puppet/ssl/certs/ca.pem"]
    ssl => "true"
  }
  gelf {
    port => "12201"
  }
  tcp {
    codec => fluent
    port => "5000"
  }
}

---
path.data: "/var/lib/logstash"
path.config: "/etc/logstash/conf.d"
path.logs: "/var/log/logstash"
pipeline:
  batch:
    size: 10
  workers: 1000

Hi,

an older version of logstash => 5.1.2-1 works without problems.
Some other seems to have the same problems: https://github.com/logstash-plugins/logstash-input-lumberjack/issues/19

Is there anyone that could guess which changes / implementations in the newer version lead to the described failures?

Thanks,
Thomas

Why do you have the number of workers set to 1000???

Hi,

with that high worker count it didn`t die that fast.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.