- Version: latest stable
- Operating System: Debian 8
- Steps to Reproduce:
Description
There is 4 filebeat nodes, a 2 node kafka cluster, and data is being send from beat to kafka correctly in this structure.
fb fb fb .. fb
| | | |
kf ------- kf
This issue is happened when one of the network between a fb node and a kf node is disabled(by setup a firewall and limiting the port visiting between two nodes), but network between kf nodes is ok, so the kf cluster is still in healthy status, like this.
fb fb fb .. fb fb fb fb .. fb
|X | | | ---------> | | |
kf ------- kf kf ------- kf
Finally, no more data can be send from fb to kf cluster because of an endless reconnect attempt on fb.
And This reconnect problem can be solved by kill the not connect kafka node from the cluster. Then data can be send again.
fb fb fb .. fb fb fb fb .. fb
| | | ---------> | | | |
kf ------- kf kf
Assumption
From things happened above, i give my assumption:
-
When a network disable happened, Strategy for choosing the next node is not suitable to discussing above!
-
Which is caused by the source code shown below
vfunc (client *client) cachedLeader(topic string, partitionID int32) (*Broker, error) { client.lock.RLock() defer client.lock.RUnlock() partitions := client.metadata[topic] # <-------- 1) if partitions != nil { metadata, ok := partitions[partitionID] # <-------- 2) if ok { if metadata.Err == ErrLeaderNotAvailable { return nil, ErrLeaderNotAvailable } b := client.brokers[metadata.Leader] # <-------- 3) if b == nil { return nil, ErrLeaderNotAvailable } _ = b.Open(client.conf) return b, nil } } return nil, ErrUnknownTopicOrPartition }
-
Since the kafka cluster is still in healthy status, metadata of the topic, leader of the replica, or the nodes have not changed!
-
By the step of 1),2),3), It will still choose the same partion in the same node to send the data, And the endless loop will then happen
Solution
Hopefully, It can be solve by taking a more thorough fault-tolerence strategy which takes different action to solve the different network errors