Logstash has a kafka input and output to elasticsearch and syslog (I'm using logstash output isolator pattern). When one kafka node is down (kafka3 in this case), logstash stop to consume/push logs to elasticsearch with the following message :
Group coordinator kafka3:9092 (id: 2147483644 rack: null) is unavailable or invalid, will attempt rediscovery
When I start the down kafka node no more error message :
|[2021-04-27T18:10:36,988][INFO ][org.apache.kafka.clients.consumer.internals.AbstractCoordinator][poc-filebeat2][8dfdf7cf24fb7b5cb09ad368706e1e176d6bf914663fc10e1a5df56283e6de0c] [Consumer clientId=logstash-0, groupId=logstash] Discovered group coordinator kafka3:9092 (id: 2147483644 rack: null)||||||
|---|---|---|---|---|---|
|[2021-04-27T18:10:36,987][INFO ][org.apache.kafka.clients.consumer.internals.AbstractCoordinator][poc-filebeat2][8dfdf7cf24fb7b5cb09ad368706e1e176d6bf914663fc10e1a5df56283e6de0c] [Consumer clientId=logstash-1, groupId=logstash] Member logstash-1-e671c77f-6c8c-4606-9038-feecbe6b3bf5 sending LeaveGroup request to coordinator kafka3:9092 (id: 2147483644 rack: null) due to consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.||||||
|[2021-04-27T18:10:36,991][INFO ][org.apache.kafka.clients.consumer.internals.ConsumerCoordinator][poc-filebeat2][8dfdf7cf24fb7b5cb09ad368706e1e176d6bf914663fc10e1a5df56283e6de0c] [Consumer clientId=logstash-1, groupId=logstash] Giving away all assigned partitions as lost since generation has been reset,indicating that consumer is no longer part of the group||||||
|[2021-04-27T18:10:36,992][INFO ][org.apache.kafka.clients.consumer.internals.ConsumerCoordinator][poc-filebeat2][8dfdf7cf24fb7b5cb09ad368706e1e176d6bf914663fc10e1a5df56283e6de0c] [Consumer clientId=logstash-1, groupId=logstash] Lost previously assigned partitions default.linux-2, default.linux-3||||||
|[2021-04-27T18:10:36,993][INFO ][org.apache.kafka.clients.consumer.internals.AbstractCoordinator][poc-filebeat2][8dfdf7cf24fb7b5cb09ad368706e1e176d6bf914663fc10e1a5df56283e6de0c] [Consumer clientId=logstash-1, groupId=logstash] (Re-)joining group||||||
But I didn't expected that one kafka node down would have such impact...
Any idea on how to tell to logstash to continue even if one kakfa node is down ?
I don't think this is a LogStash problem. Can you please check which replication factor your topic uses?
I am guessing that your topic has replication factor of 1 which means no replication to other nodes. In this case you cannot access the data as it is only stored on this kafka3.
I dindn't changed offsets.topic.replication.factor, so I must have the default setting which is 3 when reading the documentation but I should verifiy it maybe ?
Yep good idea concerning consumer offset topic. Here is the output :
Does not makes sense for me...Once kafka3 is down, replica has been promoted as leader replacing the leader role of kafka3. As you can see, even if kafka3 is missing, all the partitions have a leader. So for me, all data are there and logstash should consume
One more question : With 3 kafka brokers, if I loose 1 or 2 brokers, do you think that logstash will be impacted ? I know that logstash will be able to consume but is there any side effect ? (like more charge)
As long as the topics are distributed over all kafka nodes LogStash should not be impacted as long as the remaining kafka node is not under too much pressure. Of course, if the last kafka node has lots of documents coming in and out the performance of LogStash reading data from kafka could be impacted too...
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.