Filebeats errors and 100% CPU with stand alone Kafka

When filebeats is configured to output to kafka, and kafka + zookeeper are running stand alone (single server), Filebeats starts spewing errors because it can't figure out which kafka is the leader, because the topic has no leader, because Zookeeper is running stand alone. Getting errors like:

2016-11-02T14:41:20+13:00 WARN client/metadata fetching metadata for [beats] from broker kq:9092
2016-11-02T14:41:20+13:00 WARN kafka message: client/metadata found some partitions to be leaderless
2016-11-02T14:41:20+13:00 WARN client/metadata fetching metadata for [beats] from broker kq:9092
2016-11-02T14:41:20+13:00 WARN kafka message: client/metadata found some partitions to be leaderless
2016-11-02T14:41:20+13:00 WARN client/metadata retrying after 250ms... (3 attempts remaining)
2016-11-02T14:41:20+13:00 WARN client/metadata fetching metadata for [beats] from broker kq:9092
2016-11-02T14:41:20+13:00 WARN kafka message: client/metadata found some partitions to be leaderless
2016-11-02T14:41:20+13:00 WARN client/metadata retrying after 250ms... (2 attempts remaining)

---->>>>>> repeating every 250ms

It also appears that maybe these are leaving connections, or something open, because if we then shutdown the single kafka node, filebeats goes to 100% CPU and sits there, with more errors, like:

2016-11-02T14:42:22+13:00 WARN client/metadata fetching metadata for [beats] from broker kq:9092
2016-11-02T14:42:22+13:00 WARN Failed to connect to broker kq:9092: dial tcp 127.0.0.1:9092: getsockopt: connection refused
2016-11-02T14:42:22+13:00
 WARN kafka message: client/metadata got error from broker while 
fetching metadata:%!(EXTRA *net.OpError=dial tcp 127.0.0.1:9092: 
getsockopt: connection refused)
2016-11-02T14:42:22+13:00 WARN kafka message: client/metadata no available broker to send metadata request to
2016-11-02T14:42:22+13:00 WARN client/brokers resurrecting 1 dead seed brokers
2016-11-02T14:42:22+13:00 WARN client/metadata retrying after 250ms... (2 attempts remaining

I think this is a bug, and it's probably around this bit of code:

Please provide comment.

What exactly is the bug? Have you tried to change the meta-data retry interval?

Changing: retry.backoff does affect the speed that it continues to retry, but
retry.max does not seem to be honoured. It is set at 3, the logs count down the number of retries from 3, 2, 1, then, immediately go back to 3 and repeat.

I think it repeats because, it can't find a leader, because the topic replication factor is 1, as using a stand alone kafka+zookeeper. I guess it's not been tested with a stand alone (single node) Kafka+Zookeeper?

Should it be able to work when a topic looks like this:
./kafka-topics.sh --zookeeper localhost --describe --topic test
Topic:test PartitionCount:1 ReplicationFactor:1 Configs:
Topic: test Partition: 0 Leader: 0 Replicas: 0 Isr:

So, bugs are:

  1. retry.max not not stop filebeat from repeatedly trying to query the meta data about a topic, when the leader value is 0.

  2. After it has been trying to collect meta data, like above and looping, if Kafka it shutdown, Filebeats goes to 100% CPU.

Do you have some logs about retry.max going down and being reset to 3? I wonder if this due to the kafka client library used by libbeat or beats itself.

Even if topic is not available, it might be created at some point in time. That is, beats have to retry even if topic does not exit yet.

Which backoff value did you set? This goes hand in hand with my first question about logs. Maybe we can introduce a short retry backoff for up to max retries and if these fail have a very looongish backoff.

The case of kafka shutting down is and filebeat going to 100% still happens with big backoff value? From past experience I remember kafka itself + java based consumer going crazy when zookeeper becomes unavailable.

Definitely something to investigate. Feel free to open a github issue with all information you have, so this can be tracked as a bug.

I had the defaults set for filebeats Kafka output, so
metadata:
retry.max: 3
retry.backoff: 250ms
refresh_frequency: 10m

The retry.max did count down, but then looped and started counting down again from 3. It did it at the retry.backoff interval. Increasing retry.backoff slowed the loop, but it still kept looping.

What I think it should do, is realise that the leader value is going to be 0, because replicas are 0, so to stop trying, or try at the refresh_frequency.

Issue: https://github.com/elastic/beats/issues/2945 opened

This topic was automatically closed after 21 days. New replies are no longer allowed.