Hello anyone!
I have a some problems with a Filebeat. I collecting logs from a 2 files and send it to a Logstash. Events rate decreasing after a few hours of work. Events rate becomes normal after restart filebeat or connection to Logstash was reset. Normal event rate is a 250 e/s, stalled - ~50 e/s.
In additional: it so strange, early filebeat worked properly not more than 6 hours. But it works more than 12 hours after last incident. I'll try to reproduce this behavior and get a syscalls by a strace. I hope it will be useful.
Hello,
Debug logs here. This logs covers too much period. Interesting period from 2020-08-03 09:30 to 2020-08-03 11:09. I can remove unactual lines. Let me know if you want it.
Also log file is filtered by egrep -h '^[0-9]+.*[A-Z]{4,}' filebeat{.1,} | grep -v 'Publish event'. I think you don't needs a events in this log file.
I don't see anything specially bad, but something called my attention. Starting on ~9:30, there are many batches sent with 2048 events:
2020-03-08T09:30:05.370Z DEBUG [logstash] logstash/async.go:159 2048 events out of 2048 events sent to logstash host elk.local:5043. Continue sending
2048 is the default maximum batch size. Is it possible that starting on 9:30 your nginx server is generating more logs? It could be that something in the pipeline cannot cope with this load and gets saturated, affecting perfomance.
If you are also monitoring logstash, please also check its metrics to see if it could be constrained of some resource. You may also need to increase the number of workers or max batch size in logstash.
If you are monitoring Elasticsearch, it could be also interesting to see the ingest rate during the periods when filebeat is having low performance.
Yes, it is. Sometimes (every 10 minutes) we have a request rate spikes (1500-4000k RPS). But not all times performance is affected. Before using filebeat we made some tests and got more than 6k event/sec performance in a default configuration.
I still trying to resolve issue without restarting beats every hour and I found next curious thing. This command shows events sent and acked events by a logstash:
# Issue started here
Sent: 2048/2048
Sent: 1806/1806
Sent: 242/242
Acked: 2048
Yes, this is why I suggested to try to increase the bulk_max_size, that could be involved here, and suspiciously defaults to 2048. Did you see any difference in these values after increasing bulk_max_size?
There is a possible bug in the logstash output that can make a beat to continuously retry to send the same batch of events if any of the events is rejected. I created an issue for that some time ago, but we are not sure of the conditions when it happens https://github.com/elastic/beats/issues/11732. In any case if this is the issue we should see something about failed events in filebeat or logstash logs.
Would it be an option for you to try to send events directly from filebeat to Elasticsearch? Without logstash? So we can confirm if this is an issue with filebeat, or with logstash or the logstash output.
No changes. I tried to increase bulk_max_size to 4096 and no changes.
Unfortunately I cannot do it because I using ClickHouse instead of ES.
But I found a curious thing again. I disabled compression (I tried to capture traffic) in filebeat on one of my servers and problem is gone. 4 days without problems.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.