Filebeats errors and 100% CPU with stand alone Kafka

nelg · November 2, 2016, 3:04am

When filebeats is configured to output to kafka, and kafka + zookeeper are running stand alone (single server), Filebeats starts spewing errors because it can't figure out which kafka is the leader, because the topic has no leader, because Zookeeper is running stand alone. Getting errors like:

2016-11-02T14:41:20+13:00 WARN client/metadata fetching metadata for [beats] from broker kq:9092
2016-11-02T14:41:20+13:00 WARN kafka message: client/metadata found some partitions to be leaderless
2016-11-02T14:41:20+13:00 WARN client/metadata fetching metadata for [beats] from broker kq:9092
2016-11-02T14:41:20+13:00 WARN kafka message: client/metadata found some partitions to be leaderless
2016-11-02T14:41:20+13:00 WARN client/metadata retrying after 250ms... (3 attempts remaining)
2016-11-02T14:41:20+13:00 WARN client/metadata fetching metadata for [beats] from broker kq:9092
2016-11-02T14:41:20+13:00 WARN kafka message: client/metadata found some partitions to be leaderless
2016-11-02T14:41:20+13:00 WARN client/metadata retrying after 250ms... (2 attempts remaining)

---->>>>>> repeating every 250ms

It also appears that maybe these are leaving connections, or something open, because if we then shutdown the single kafka node, filebeats goes to 100% CPU and sits there, with more errors, like:

2016-11-02T14:42:22+13:00 WARN client/metadata fetching metadata for [beats] from broker kq:9092
2016-11-02T14:42:22+13:00 WARN Failed to connect to broker kq:9092: dial tcp 127.0.0.1:9092: getsockopt: connection refused
2016-11-02T14:42:22+13:00
 WARN kafka message: client/metadata got error from broker while 
fetching metadata:%!(EXTRA *net.OpError=dial tcp 127.0.0.1:9092: 
getsockopt: connection refused)
2016-11-02T14:42:22+13:00 WARN kafka message: client/metadata no available broker to send metadata request to
2016-11-02T14:42:22+13:00 WARN client/brokers resurrecting 1 dead seed brokers
2016-11-02T14:42:22+13:00 WARN client/metadata retrying after 250ms... (2 attempts remaining

I think this is a bug, and it's probably around this bit of code:

github.com

elastic/beats/blob/3baa352e6fb68cb5ff8abb25e84bb0557c1a5e28/vendor/github.com/Shopify/sarama/client.go#L593


			Logger.Printf("client/metadata fetching metadata for %v from broker %s\n", topics, broker.addr)
		} else {
			Logger.Printf("client/metadata fetching metadata for all topics from broker %s\n", broker.addr)
		}
		response, err := broker.GetMetadata(&MetadataRequest{Topics: topics})


		switch err.(type) {
		case nil:
			// valid response, use it
			if shouldRetry, err := client.updateMetadata(response); shouldRetry {
				Logger.Println("client/metadata found some partitions to be leaderless")
				return retry(err) // note: err can be nil
			} else {
				return err
			}


		case PacketEncodingError:
			// didn't even send, return the error
			return err
		default:
			// some other error, remove that broker and try again

Please provide comment.

steffens · November 2, 2016, 10:48am

What exactly is the bug? Have you tried to change the meta-data retry interval?

nelg · November 2, 2016, 9:23pm

Changing: retry.backoff does affect the speed that it continues to retry, but
retry.max does not seem to be honoured. It is set at 3, the logs count down the number of retries from 3, 2, 1, then, immediately go back to 3 and repeat.

I think it repeats because, it can't find a leader, because the topic replication factor is 1, as using a stand alone kafka+zookeeper. I guess it's not been tested with a stand alone (single node) Kafka+Zookeeper?

Should it be able to work when a topic looks like this:
./kafka-topics.sh --zookeeper localhost --describe --topic test
Topic:test PartitionCount:1 ReplicationFactor:1 Configs:
Topic: test Partition: 0 Leader: 0 Replicas: 0 Isr:

So, bugs are:

retry.max not not stop filebeat from repeatedly trying to query the meta data about a topic, when the leader value is 0.
After it has been trying to collect meta data, like above and looping, if Kafka it shutdown, Filebeats goes to 100% CPU.

steffens · November 3, 2016, 9:50pm

Do you have some logs about retry.max going down and being reset to 3? I wonder if this due to the kafka client library used by libbeat or beats itself.

Even if topic is not available, it might be created at some point in time. That is, beats have to retry even if topic does not exit yet.

Which backoff value did you set? This goes hand in hand with my first question about logs. Maybe we can introduce a short retry backoff for up to max retries and if these fail have a very looongish backoff.

The case of kafka shutting down is and filebeat going to 100% still happens with big backoff value? From past experience I remember kafka itself + java based consumer going crazy when zookeeper becomes unavailable.

Definitely something to investigate. Feel free to open a github issue with all information you have, so this can be tracked as a bug.

nelg · November 6, 2016, 10:24pm

I had the defaults set for filebeats Kafka output, so
metadata:
retry.max: 3
retry.backoff: 250ms
refresh_frequency: 10m

The retry.max did count down, but then looped and started counting down again from 3. It did it at the retry.backoff interval. Increasing retry.backoff slowed the loop, but it still kept looping.

What I think it should do, is realise that the leader value is going to be 0, because replicas are 0, so to stop trying, or try at the refresh_frequency.

Issue: https://github.com/elastic/beats/issues/2945 opened

system · November 23, 2016, 3:04am

This topic was automatically closed after 21 days. New replies are no longer allowed.

Topic		Replies	Views
WARN kafka message: client/metadata found some partitions to be leaderless Beats filebeat	4	4868	February 2, 2018
Filebeat + Kafka error Beats filebeat	2	2362	December 26, 2017
Filebeat , Kafka Cluster and Zookeeper ensemble configuration with SASL authentication Beats filebeat	3	1354	June 25, 2018
[bug] filebeat can't send data to kafka cluster when one kafka can't connect because of network Beats	6	5122	August 19, 2016
Three types of WARN/ERRORs in filebeat logs Beats filebeat	11	3525	November 24, 2017

Filebeats errors and 100% CPU with stand alone Kafka

Related topics