I don't see anything special in your configs. The connection between filebeat -> logstash uses TCP. Plus filebeat waits for ACK from logstash, to update the registry and free resources of published log messages. If no ACK is received, filebeat has to assume a network/logstash failure and send events. Any system using TCP for pushing messages/events is subject to back-pressure.
Second to TCP and ACK, filebeat also applies some 'windowing', which starts at 10 events and exponentially grows up to
bulk_max_size. If logstash is overloaded and we see errors/disconnects, the window can shrink and at some point even get stalled at a not that optimum value. This is to ensure beats don't break/kill older versions of Logstash, not sending some health-ping. But the window size can only be inspected via debug logs (
Logstash internally uses queues and pushes data to Elasticsearch (or other services), mostly via TCP and Request-Response like protocols (HTTP). That is, the outputs, network and Logstash filters can contribute to the overall back-pressure experienced by filebeat. Sometimes it's a 'bad' log messages
grok has to grind upon (consider
dissect filter if you find
grok to be slow).
Networking can play a big role during peek times, as bandwidth might be shared with the others application bandwidth requirements (plus QoS settings?), potentially affecting the TCP connections throughput. Add things like bufferbloat in network devices/OS and buffered TCP segments being resend + increases latencies negating TCP congestion control can negatively impact throughput.
A simple technique to relief a system from
temporal overload/back-pressure is buffering in intermediate queues. That's one use-case for the persistent-queue in Logstash, as the queue can accept/ACK events, if if filters/outputs can not keep up. Still, when using queues, they should operate in almost-empty-state all the time. If you find your queues to be full for much to long time, the buffering effect is mostly neutralised for no good.
You can test filebeat->file/console output, to get an idea how fast filebeat can process your logs. Normally network outputs add some more overhead (e.g. compression for LS output, network, decompression, decoding in LS, ...), getting you some more back-pressure in filebeat.
In 6.0beta1 we introduce asynchronous sending of events + pipelining of batches by default. Pipelining can reduce network/encoding/decoding latencies, potentially increasing throughput.
How many logstash instances are you running in your system. If you already have multiple instances, you might consider to:
- use load-balancing of all beats to all logstash instances
- configure all logstash instances in all beats without load balancing.
In the later case, filebeat will connect to one logstash instance by random. On I/O error, filebeat will reconnect to another Logstash instance by random. But if filebeat is only slowed down (without I/O error), it will not try to reconnect.