The issue only occurs on our "slower" servers, i.e. those with low filebeat volume. All our servers are configured to loadbalance between 4 logstash servers. For our busy servers, this works fine - they maintain (and use) all 4 connections. On some of the slow servers, (those without as much filebeat traffic), they drop down to just one connection to a single logstash server. Here is what seems happens (this isn't the bug, this seems to be expected behavior - let me know if I'm wrong):
- Logstash resets (drops) connections from filebeat servers that have been idle for a while (
write: connection reset by peerin the filebeat log).
- Because of the low traffic volume, filebeat doesn't bother reconnecting, since it has other connections and not much data to send, (again, this is the behavior I've observed so I'm assuming it is expected - let me know if I'm wrong)
- If the volume is very low, all 4 connections will eventually get dropped. In this case, filebeat just chills until it has more data to send then picks one logstash server to connect to to send the data, (it may actually re-establish to all 4 - I wasn't paying attention).
- If the volume is not low enough for the connection to be dropped by that last logstash server, then filebeat will go along merrily with that one connection logging only to that one logstash server (since this logic is for load balancing, not redundancy, this seems reasonable).
The problem occurs in this scenario at this point: If that one logstash server disappears, filebeat has no other open connections to round-robin to, and does not try to establish them - it just waits for that one logstash server. I've seen this wait up to 1 hour, (and the resume sending to that logstash server when it comes back online).
filebeat version 6.8.3 (amd64), libbeat 6.8.3
Logstash version 6.8.3
output.logstash: enabled: true hosts: ["logstash1:5046", "logstash2:5046", "logstash3:5046", "logstash4:5046"] loadbalance: true
(Sorry for the repost - this seems like a bug and I wanted to get some kind of feedback before I submitted it as a bug. I also rewrote the text and added more details.)