Filebeat start endless reconnectiosn when connect error happen to kafka cluster

  • Version: latest stable
  • Operating System: Debian 8
  • Steps to Reproduce:

Description

There is 4 filebeat nodes, a 2 node kafka cluster, and data is being send from beat to kafka correctly in this structure.

fb   fb   fb .. fb
|     |     |    |
   kf ------- kf

This issue is happened when one of the network between a fb node and a kf node is disabled(by setup a firewall and limiting the port visiting between two nodes), but network between kf nodes is ok, so the kf cluster is still in healthy status, like this.

fb   fb   fb .. fb                                  fb   fb   fb .. fb
|X   |     |    |          --------->                  |     |      |
  kf ------- kf                                        kf ------- kf

Finally, no more data can be send from fb to kf cluster because of an endless reconnect attempt on fb.
And This reconnect problem can be solved by kill the not connect kafka node from the cluster. Then data can be send again.

fb   fb   fb .. fb                                  fb   fb   fb .. fb
   |     |      |          --------->                |    |     |    |
  kf ------- kf                                             kf

Assumption

From things happened above, i give my assumption:

  • When a network disable happened, Strategy for choosing the next node is not suitable to discussing above!

  • Which is caused by the source code shown below

      vfunc (client *client) cachedLeader(topic string, partitionID int32) (*Broker, error) {
      	client.lock.RLock()
      	defer client.lock.RUnlock()
      	partitions := client.metadata[topic] # <-------- 1)
      	if partitions != nil {
      		metadata, ok := partitions[partitionID] # <-------- 2)
      		if ok {
      			if metadata.Err == ErrLeaderNotAvailable {
      				return nil, ErrLeaderNotAvailable
      			}
      			b := client.brokers[metadata.Leader] # <-------- 3)
      			if b == nil {
      				return nil, ErrLeaderNotAvailable
      			}
      			_ = b.Open(client.conf)
      			return b, nil
      		}
      	}
      	return nil, ErrUnknownTopicOrPartition
      }
    
  • Since the kafka cluster is still in healthy status, metadata of the topic, leader of the replica, or the nodes have not changed!

  • By the step of 1),2),3), It will still choose the same partion in the same node to send the data, And the endless loop will then happen

Solution

Hopefully, It can be solve by taking a more thorough fault-tolerence strategy which takes different action to solve the different network errors

see my response here

Oh thanks! And WangXiangUSTC is my work mate! What a little world!

This topic was automatically closed after 21 days. New replies are no longer allowed.