Hello,
we use Logstash to collect syslog messages from network devices. The instance has one pipeline listening on UDP port 7514 with 16 workers, the filter part consists of a couple of grok patterns, mutate fileds, DNS lookup and a ruby script (tried removing the ruby script, the problem persists). Output is to Elasticsearch.
At some point every 24 hours Logstash stops collecting messages, the service is running but there are no workers. I get the PID of Logstash and check if there are any running workers:
ps aux | grep logstash-core | grep java | egrep -v "0.0 0.0"
top -p 55552 -Hb -n 1 | egrep -v "[A-Z]..0.0" | egrep -c "worker|udp"'
I have tried diagnosing with the REST API but the status is always green and the number of workers is always the one that is set in the configuration. Also the logs are not showing any errors. All of this is happening on Logstash server:
CentOS 7
CPU: 4 x Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz
RAM: 8G
DISK: 20G
Logstash version: 7.10.0
Java:
openjdk version "1.8.0_232"
OpenJDK Runtime Environment (build 1.8.0_232-b09)
OpenJDK 64-Bit Server VM (build 25.232-b09, mixed mode)
more /etc/logstash/logstash.yml
pipeline.batch.size: 250
pipeline.batch.delay: 50
pipeline.unsafe_shutdown: true
pipeline.workers: 64
path.data: /var/lib/logstash
config.reload.automatic: true
config.reload.interval: 10s
path.logs: /var/log/logstash
What makes it more counter-intuitive the outages happen when there is less traffic, in times when the devices output less data.
Anyone has any idea how to fix this problem, or any ides for further diagnostics?