Filebeat and logstash 6.2
I have 18k+ devices with filebeat sending to logstash. My logstash is in AWS and is fronted with a classic ELB. I terminate ssl on the ELB and forward to the correct part for logstash.
I had an incident where my logstash nodes were unhealthy and the port became unavailable. I have 8 logstash nodes and eventually they all became unavailable because they were being overworked. This forced OOM errors and made it so the nodes couldn't process any.ore. This isse is that the filebeat agents were still trying to connect.
On the client side we say filebeat connect to the ELB and ssl was terminated but logstash closed the connection. Immediately after filebeat tried again. This happened many times.es per second per host. And with 18k hosts doing this it utilized a lot of network bandwidth.
How can I come figure filebeat to exponentially backoff when it's connection is gracefully closed but it happens many times per second.
I understand that filebeat tries to reopen the connection so it can optimize latency but the issue here is when logstash was closing g the connections because it couldn't handle the load the agents DDoS'd logstash.
Are there any best practices that can be used to ratelimit filebeat? For this type of scenario what could be done? If I were to move the ssl handoff to logstash instead of the ELB would that change the behavior of filebeat?
Thank you. Trying to take lessons learned from this incident to build a more reliable and resilient pipeline