Hi,
we have the following setup:
- Many Clients with collectors sidecar and filebeat installed, sending their logs to logstash
- Some Clients that send their openshift logs via fluentd to logstash
- Two Graylog / Logstash Servers that receive the logs
- Three elasticsearch Nodes that store the logs
If the graylog / logstash servers are offline for some time and we fire them up again we see ~7000msg/s incoming in the graylog gui but just for maybe one minute.
Than the messages drop to 0 and one or maybe two cpu cores on the graylog / logstash nodes show 100% but nothing more happens.
We already tried to play with workers/batch settings for logstash, that brought some effect, now its running fine for around 10 minutes but than dies..
We also tried to use the throttle filter, but this will slow down messages to ~10/s and after one day logstash also dies..
The machines have 64vcores and ~50GB Ram now... but nothing helps.
Here are the software versions:
logstash-5.4.1-1.noarch
graylog-server-2.2.3-1.noarch
logstash-codec-cef (4.1.2)
logstash-codec-collectd (3.0.3)
logstash-codec-dots (3.0.2)
logstash-codec-edn (3.0.2)
logstash-codec-edn_lines (3.0.2)
logstash-codec-es_bulk (3.0.3)
logstash-codec-fluent (3.1.1)
logstash-codec-graphite (3.0.2)
logstash-codec-json (3.0.2)
logstash-codec-json_lines (3.0.2)
logstash-codec-line (3.0.2)
logstash-codec-msgpack (3.0.2)
logstash-codec-multiline (3.0.3)
logstash-codec-netflow (3.4.0)
logstash-codec-plain (3.0.2)
logstash-codec-rubydebug (3.0.2)
logstash-filter-clone (3.0.2)
logstash-filter-csv (3.0.2)
logstash-filter-date (3.1.5)
logstash-filter-dissect (1.0.8)
logstash-filter-dns (3.0.3)
logstash-filter-drop (3.0.2)
logstash-filter-fingerprint (3.0.3)
logstash-filter-geoip (4.0.4)
logstash-filter-grok (3.4.0)
logstash-filter-json (3.0.2)
logstash-filter-kv (4.0.0)
logstash-filter-metrics (4.0.2)
logstash-filter-mutate (3.1.3)
logstash-filter-ruby (3.0.2)
logstash-filter-sleep (3.0.3)
logstash-filter-split (3.1.1)
logstash-filter-syslog_pri (3.0.2)
logstash-filter-throttle (4.0.1)
logstash-filter-urldecode (3.0.3)
logstash-filter-useragent (3.0.3)
logstash-filter-uuid (3.0.2)
logstash-filter-xml (4.0.2)
logstash-input-beats (3.1.12)
logstash-input-couchdb_changes (3.1.1)
logstash-input-elasticsearch (4.0.3)
logstash-input-exec (3.1.2)
logstash-input-file (4.0.0)
logstash-input-ganglia (3.1.0)
logstash-input-gelf (3.0.2)
logstash-input-generator (3.0.2)
logstash-input-graphite (3.0.2)
logstash-input-heartbeat (3.0.2)
logstash-input-http (3.0.4)
logstash-input-http_poller (3.1.1)
logstash-input-imap (3.0.2)
logstash-input-irc (3.0.2)
logstash-input-jdbc (4.2.0)
logstash-input-kafka (5.1.7)
logstash-input-log4j (3.0.5)
logstash-input-lumberjack (3.1.1)
logstash-input-pipe (3.0.2)
logstash-input-rabbitmq (5.2.3)
logstash-input-redis (3.1.2)
logstash-input-s3 (3.1.4)
logstash-input-snmptrap (3.0.2)
logstash-input-sqs (3.0.3)
logstash-input-stdin (3.2.2)
logstash-input-syslog (3.2.0)
logstash-input-tcp (4.1.0)
logstash-input-twitter (3.0.3)
logstash-input-udp (3.1.0)
logstash-input-unix (3.0.3)
logstash-input-xmpp (3.1.2)
logstash-output-cloudwatch (3.0.4)
logstash-output-csv (3.0.3)
logstash-output-elasticsearch (7.3.1)
logstash-output-file (4.0.1)
logstash-output-graphite (3.1.1)
logstash-output-http (4.2.0)
logstash-output-irc (3.0.2)
logstash-output-kafka (5.1.6)
logstash-output-nagios (3.0.2)
logstash-output-null (3.0.2)
logstash-output-pagerduty (3.0.3)
logstash-output-pipe (3.0.2)
logstash-output-rabbitmq (4.0.7)
logstash-output-redis (3.0.3)
logstash-output-s3 (4.0.7)
logstash-output-sns (4.0.3)
logstash-output-sqs (4.0.1)
logstash-output-statsd (3.1.1)
logstash-output-stdout (3.1.0)
logstash-output-tcp (4.0.0)
logstash-output-udp (3.0.2)
logstash-output-webhdfs (3.0.2)
logstash-output-xmpp (3.0.2)
logstash-patterns-core (4.1.0)
Here is the logstash config
output {
if [kubernetes_host]
{
http {
http_method => "post"
url => "http://localhost:12201/gelf"
codec => "json"
}
}
if [type] == 'syslog'
{
http {
http_method => "post"
url => "http://localhost:12205/gelf"
codec => "json"
}
}
}
filter{
# Rate Limiting, da ein Wiederanlauf sonst Ueberlastung bedeutet
throttle {
before_count => 3
after_count => 5
period => 3600
max_age => 7200
key => "%{host}%{message}"
add_tag => "throttled"
}
if "throttled" in [tags] {
sleep {
time => "0.1"
}
}
# Message um den Quelldateinamen erweitern (wuerde sonst bei logstash verloren gehen)
mutate{
add_field => {
"file_path" => "%{source}"
}
}
}
input {
beats {
port => "5044"
ssl_verify_mode => "force_peer"
ssl_certificate => "/etc/pki/tls/certs/xxxxxxx.crt"
ssl_key => "/etc/pki/tls/certs/xxxxxxxxxx.pkcs.key"
ssl_certificate_authorities => ["/etc/puppetlabs/puppet/ssl/certs/ca.pem"]
ssl => "true"
}
gelf {
port => "12201"
}
tcp {
codec => fluent
port => "5000"
}
}
---
path.data: "/var/lib/logstash"
path.config: "/etc/logstash/conf.d"
path.logs: "/var/log/logstash"
pipeline:
batch:
size: 10
workers: 1000