Rate limiting between Filebeat and Logstash?


(Draven Johnson) #1

If there any rate limiting between filebeat and logstash?
or some kind of back pressure setting that between filebeat and logstash?

We have ~2K server running filebeat and input logs to (12) logstash by using the beat plugin.
At some time during the day, few (usually ~15 now ~40) hosts start slow down the data transport speed and start lagging reporting to logstash.

Seem to us, that either logstash or the client filebeat installed is overloaded or network bandwidth limit this. Restart filebeat which allow it pick up another logstash host usually fix the problem

any idea?


#2

Hi,

I can only confirm part of your question. Back pressure exists between logstash and filebeat. Please refer to this link: https://www.elastic.co/guide/en/logstash/current/persistent-queues.html#backpressure-persistent-queue

Hope it helps


(Steffen Siering) #3

Can you share filebeat configuration and logs?


(Draven Johnson) #4

We don't enable the logging for filebeat, sorry.

It's really hard to reproduce this situation since you have to have certain amount of filebeat / servers in order to see that. It must be something to do with back pressure.


(Draven Johnson) #5

Any update on this?


(Steffen Siering) #6

I don't see anything special in your configs. The connection between filebeat -> logstash uses TCP. Plus filebeat waits for ACK from logstash, to update the registry and free resources of published log messages. If no ACK is received, filebeat has to assume a network/logstash failure and send events. Any system using TCP for pushing messages/events is subject to back-pressure.
Second to TCP and ACK, filebeat also applies some 'windowing', which starts at 10 events and exponentially grows up to bulk_max_size. If logstash is overloaded and we see errors/disconnects, the window can shrink and at some point even get stalled at a not that optimum value. This is to ensure beats don't break/kill older versions of Logstash, not sending some health-ping. But the window size can only be inspected via debug logs (-d 'logstash').

Logstash internally uses queues and pushes data to Elasticsearch (or other services), mostly via TCP and Request-Response like protocols (HTTP). That is, the outputs, network and Logstash filters can contribute to the overall back-pressure experienced by filebeat. Sometimes it's a 'bad' log messages grok has to grind upon (consider dissect filter if you find grok to be slow).

Networking can play a big role during peek times, as bandwidth might be shared with the others application bandwidth requirements (plus QoS settings?), potentially affecting the TCP connections throughput. Add things like bufferbloat in network devices/OS and buffered TCP segments being resend + increases latencies negating TCP congestion control can negatively impact throughput.

A simple technique to relief a system from temporal overload/back-pressure is buffering in intermediate queues. That's one use-case for the persistent-queue in Logstash, as the queue can accept/ACK events, if if filters/outputs can not keep up. Still, when using queues, they should operate in almost-empty-state all the time. If you find your queues to be full for much to long time, the buffering effect is mostly neutralised for no good.

You can test filebeat->file/console output, to get an idea how fast filebeat can process your logs. Normally network outputs add some more overhead (e.g. compression for LS output, network, decompression, decoding in LS, ...), getting you some more back-pressure in filebeat.

In 6.0beta1 we introduce asynchronous sending of events + pipelining of batches by default. Pipelining can reduce network/encoding/decoding latencies, potentially increasing throughput.

How many logstash instances are you running in your system. If you already have multiple instances, you might consider to:

  • use load-balancing of all beats to all logstash instances
  • configure all logstash instances in all beats without load balancing.

In the later case, filebeat will connect to one logstash instance by random. On I/O error, filebeat will reconnect to another Logstash instance by random. But if filebeat is only slowed down (without I/O error), it will not try to reconnect.


(Draven Johnson) #7

Thank you so much for this awesome answer.

We have ~12 instances. we had some bad experience with ELB (AWS's load-balancer) so we stop using it. (It had same problem as now so ELB won't help)

Right now we try to use DNS round-robin.

It might be network problem that cause this, but we want to know which part, why and how to fix this.
The most lagging hosts (there are ~20 now each day, some time same some time different) lag for ~ 2-10 hours and it mostly from hosts that in far region such as Japan, Indian etc. (Our Elk is on AWS in US-East region). On these hosts, filebeat still sending logs, just at really slow speed and have a lot of files been hold back. Restart filebeat usually allow it pick up another logastash instance and then eventually caught up all logs in ~ 10 min.

We will do more testing on this now but we also want to understand more about how filebeat -> logsatsh limit speed / rate based on Load and Network performance.

Thanks again.


(Steffen Siering) #8

Well, sending logs from Asias to US is quite a distance. Plus, I think AWS applies TCP rate limiting as well.

Some more input on deployment patterns also check:


(Draven Johnson) #9

Thank you so much for this great answer.

For now, we are doing a bunch of investigation and seem like problem exist, pretty sure it's some kind of rate limiting issue.

I am wondering if the script that use to get filebeat rate no longer works? Filebeat sending data to Logstash seems too slow
Is there any other solution to test the filebeat and logstash rate?


(Steffen Siering) #10

I haven't used the script in a while, but it should still work. Well, as we have many more variables now, the script might break if the terminal is not big enough...

filebeat reports non-zero metrics to it's logs about every 30 seconds. Actually it reports the delta. If you look for acked_events and divide it by 30s you get the rate. Plus logstash also collects internal metrics one can already use with x-pack monitoring.

For testing filebeat->logstash throughput without filters one can use this logstash configuration:

input {
  beats { ... }
}

output {
  stdout { codec => "dots" }
}

This prints a . per event being processed by logstash. Running logstash with logstash -f test.conf | pv -War >/dev/nullyou can see the current and average event rate in your terminal. As this test removes filters, outputs and any kind of additional source for back-pressure, this gets you a quite good base number on event rates you can actually send to logstash in your environment. This way you can also see how additional filebeat instances might affect overall throughput (which should not scale linearly, due to additional contention).


(system) #11

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.