Filebeat start endless reconnectiosn when connect error happen to kafka cluster

DeeeFOX · July 24, 2016, 2:41pm

Version: latest stable
Operating System: Debian 8
Steps to Reproduce:

Description

There is 4 filebeat nodes, a 2 node kafka cluster, and data is being send from beat to kafka correctly in this structure.

fb   fb   fb .. fb
|     |     |    |
   kf ------- kf

This issue is happened when one of the network between a fb node and a kf node is disabled(by setup a firewall and limiting the port visiting between two nodes), but network between kf nodes is ok, so the kf cluster is still in healthy status, like this.

fb   fb   fb .. fb                                  fb   fb   fb .. fb
|X   |     |    |          --------->                  |     |      |
  kf ------- kf                                        kf ------- kf

Finally, no more data can be send from fb to kf cluster because of an endless reconnect attempt on fb.
And This reconnect problem can be solved by kill the not connect kafka node from the cluster. Then data can be send again.

fb   fb   fb .. fb                                  fb   fb   fb .. fb
   |     |      |          --------->                |    |     |    |
  kf ------- kf                                             kf

Assumption

From things happened above, i give my assumption:

When a network disable happened, Strategy for choosing the next node is not suitable to discussing above!

Which is caused by the source code shown below

  vfunc (client *client) cachedLeader(topic string, partitionID int32) (*Broker, error) {
  	client.lock.RLock()
  	defer client.lock.RUnlock()
  	partitions := client.metadata[topic] # <-------- 1)
  	if partitions != nil {
  		metadata, ok := partitions[partitionID] # <-------- 2)
  		if ok {
  			if metadata.Err == ErrLeaderNotAvailable {
  				return nil, ErrLeaderNotAvailable
  			}
  			b := client.brokers[metadata.Leader] # <-------- 3)
  			if b == nil {
  				return nil, ErrLeaderNotAvailable
  			}
  			_ = b.Open(client.conf)
  			return b, nil
  		}
  	}
  	return nil, ErrUnknownTopicOrPartition
  }

Since the kafka cluster is still in healthy status, metadata of the topic, leader of the replica, or the nodes have not changed!
By the step of 1),2),3), It will still choose the same partion in the same node to send the data, And the endless loop will then happen

Solution

Hopefully, It can be solve by taking a more thorough fault-tolerence strategy which takes different action to solve the different network errors

steffens · July 25, 2016, 1:13pm

see my response here

DeeeFOX · July 26, 2016, 3:39am

Oh thanks! And WangXiangUSTC is my work mate! What a little world!

system · August 14, 2016, 2:42pm

This topic was automatically closed after 21 days. New replies are no longer allowed.

Topic		Replies	Views
[bug] filebeat can't send data to kafka cluster when one kafka can't connect because of network Beats	6	5160	August 19, 2016
Filebeats errors and 100% CPU with stand alone Kafka Beats	5	3304	November 23, 2016
Filebeat stops shipping data when Kafka broker is unavailable Beats	7	2950	September 20, 2018
Filebeat loses connection to Kafka Beats	4	1792	July 5, 2017
Filebeat send events to kafka repeatly Beats filebeat	3	1148	January 2, 2017

Filebeat start endless reconnectiosn when connect error happen to kafka cluster

Description

Assumption

Solution

Related topics