Filebeat flooding logstash

(Tim Desrochers) #1

Filebeat and logstash 6.2

I have 18k+ devices with filebeat sending to logstash. My logstash is in AWS and is fronted with a classic ELB. I terminate ssl on the ELB and forward to the correct part for logstash.

I had an incident where my logstash nodes were unhealthy and the port became unavailable. I have 8 logstash nodes and eventually they all became unavailable because they were being overworked. This forced OOM errors and made it so the nodes couldn't process any.ore. This isse is that the filebeat agents were still trying to connect.

On the client side we say filebeat connect to the ELB and ssl was terminated but logstash closed the connection. Immediately after filebeat tried again. This happened many per second per host. And with 18k hosts doing this it utilized a lot of network bandwidth.

How can I come figure filebeat to exponentially backoff when it's connection is gracefully closed but it happens many times per second.

I understand that filebeat tries to reopen the connection so it can optimize latency but the issue here is when logstash was closing g the connections because it couldn't handle the load the agents DDoS'd logstash.

Are there any best practices that can be used to ratelimit filebeat? For this type of scenario what could be done? If I were to move the ssl handoff to logstash instead of the ELB would that change the behavior of filebeat?

Thank you. Trying to take lessons learned from this incident to build a more reliable and resilient pipeline

(Pier-Hugues Pellerin) #2

@tgdesrochers Filebeat is using an exponential backoff when an error occur to reconnect, I think the problem in that case is each 18K is retrying to send a full batch and this is causing a OOM. I think configuring filebeat to do a slow_start would help if a cascading error happen.

We have plan to refactor the exponential backoff strategy, I will check if we can expose some knobs to the users to allow you to tweak the behavior.

(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.