Our config is:
filebeats -> Kafka
Approx 1000 aws instances running mostly rails logging approx 10 million lines per day into a 3 node Kafka - 3 node Zookeeper cluster behind an ELB.
A few days ago, we lost Kafka in the middle of the night due to a space issue. Beats could not deliver logs, so it kept the last successful line indexed in the registry on each node, as it should.
In the morning Kafka was rebuilt with more space and a more conservative retention period. However by this time a significant backlog of undelivered logs had accumulated on the 1000 nodes.
When the Kafka cluster was back online we saw the traffic to the kafka cluster go from a normal 7 million bytes per minute per kafka node to 700 million bytes per minute, where it remained flatlined (maxed out) for several hours. During this time, Kibana only gained about 1 hour of logs, about 40 million lines, most of which were duplicates.
I reconfigured beats to include tail_files and removed the registry during the beats restart and all returned to normal (minus the nights logs).
It appears that filebeats may be too aggressive in trying to recover from the out of sync condition.
1000 nodes trying to deliver a backlog of logs was overwhelming the endpoint, which may not necessarily be Kafka, it may be the same with redis, logstash or elasticsearch.
Coincidentally, another anomaly occurs where duplicates are opened of the log files which may exacerbate the issue by overwriting the registry with an old pointer (written by one beats harvester) with the failed results of another. This could cause repeated log lines being delivered to the endpoint, almost endlessly (although it seems to stop at 2.4 million identical lines)
Filebeats becomes a highly efficient gattling gun, hammering the endpoint with log lines, some (or most?) duplicates.
Given that Kafka does not return info in the Ack about the state of the cluster, it may be necessary to give users a parameter to throttle beats so that the runaway recovery can be managed. Perhaps a maximum log lines per minute. Even if beats hammers for 10 seconds to deliver the maximum then stops until the next minute would be better than the relentless delivery of log lines.
Eventually, it may be worthwhile to investigate some intelligent auto throttling based on the response time of the endpoint.