Given your Reproduction steps and actual result, I'd say filebeat is behaving correctly. It is not having problems to send data, but it detects one of the brokers is not reachable and therefore blocks until your cluster is stable again.
With kafka, loats of logic is pushed into the clients. Especially the producers. In kafka one creates a topic with N partitions. These partitions control the amount of load balancing you have in the system. Having N partitions, the data shall be distributed about equally among these partitions. Partitions are stored on disk (and partitions again are split into segment files). The sizing and retention policies must take into account the amount of data to be stored in a partition. The retention policy is by time, not by size. Common issue people run into: disk is full.
Partitions are not only about scaling disk usage. Consumers read from partitions. A consumer in a consumer group can read from at least 1 partition. But one partition can only be read by one consumer. That is, the number of partitions also controls the amount of scaling you can have downstream. Proper load-balancing in the consumers depends on the producers properly load-balancing events among the brokers.
Producers/consumers do not operate with single brokers, but against the cluster. This means, when connecting a producer/consumer to a cluster, the clients start with bootstrapping all broker connection. While bootstrapping, one broker in the list of configured brokers is used to query the cluster meta-data from. The meta-data do contain the cluster state and all brokers + advertised hostname. A client creates connections the the cluster brokers based on the cluster metadata, not based on the initial configuration.
- Filebeat is configured to ship data to Kafka brokers
- there are three Kafka brokers defined in Filebeat configuration
The configuration is only for the connection bootstrap. The kafka cluster configures where to send events to.
- Kafka brokers are synchronizing data between themselves - so it only needs to reach one of the brokers, no need to ship to all of them
State in kafka is very localized. Beats (or any other client) do not publish events to brokers. They publish events to partitions. Brokers act as 'leader' and 'followers' on partitions. A client must send data to the leader-brokers in order to publish the event to the partition. A leader forwards events to the replicas (follower-brokers) only. A replicate can be in sync or out of sync. If the leader-broker goes down, the partition can be served by another broker if and only if there is another in-sync replicate broker holding the same partition. If not, then the partition is absent. Having replicas is not just about data-safety, it's also a way of falling over in case of single broker outages.
- one of the brokers in unrechable (network) from Filebeat location
This is a major problem with kafka. Do not operate kafka in a different data-center. When using kafka, outages are expected to be spurious. When a broker becomes unreachable, then it means a number of partitions become unreachable.
Having unreachable partitions, we can follow 2 strategies:
- block until unreachable partitions are available again
- distribute events only among the reachable partitions
Each strategy comes with advantages and disadvantages. In case of strategy 1 the disadvantage is: long running network failures will block clients. The disadvantage of 2 is a very uneven load-balancing. At worst all events might end up in one partition only. This must be taken into account when sizing and deciding on retention periods. E.g. retention by size might lead to data loss at worst. Retention by time might lead to disk running out of space -> another set of partitions becomes unavailable...
Using outputs.kafka.partition.round_robin.reachable_only
you can choose strategy 1 by setting it to true
(default setting) and choose strategy2 by setting it to false
.
Having network outages/hiccups should be the exception, not the norm. The kafka cluster itself might still be correct. That is, clients have to try to reconnect all the time. The producer operates asynchronously to beats and retries scheduled events. If it continuous failing, events might be returned for retry. But the broker is still the 'official' broker for the partition (cluster meta-data say the broker is available). So it has to retry. Grep logs for regex 'producer.*sate change to.*retry'
. In this state the partition is subject to event retries. Yet, retries are failing.
If you logs contain a string saying there is currently no leader for this partition
, then there is no leader for this partition in the cluster. In this case events will be tried to reroute to other partitions.
In beats to the memory queue defines an upper bound on events to be published. Outputs must ACK events, so to free up slots in the ring-buffer. If one broker holds on events retrying, the queue runs full and blocks -> other outputs will not receive events.
- after being restarted Filebeat properly ships all delayed logs (despite one of the brokers still being inactive)
- after some time passes, the problem reoccurs and logs stop being shipped
Filebeat doesn't know yet about the network problems and schedules events among partitions. Each partition/broker publishes events independently. The cluster state returned by kafka tells the producer that the cluster is operating correctly, so producers have to try. Those partitions that are reachable can be served. Those that are not reachable will fail after some timeout. Beats stop sending events once the output buffer is full and one partition is detected to be unreachable.
So to summarize - looks like some sort of memory leak to me? The process seems to have some internal problem that slowly kills it.
Given your report I don't see any evidence for a memory leak. The process is not slowly killed, but it has to retry. Even with reachable_only: false
, a broker/partition might be subject to retry. For retrying, events must be send. In case of retrying, all progress might be slowed down.