I have the same problems. After some time messages are marked as un-acked and repeated constantly. It's sudden, in the same time kafka operates without any problems, other nodes don't have problems with sending messages:
From logs:
2016-10-07T07:06:05-07:00 INFO Non-zero metrics in the last 30s: libbeat.publisher.published_events=2735 registrar.states.update=2735 registrar.writes=6 publish.events=2735 libbeat.kafka.call_c
ount.PublishEvents=6 libbeat.kafka.published_and_acked_events=2735
2016-10-07T07:06:35-07:00 INFO Non-zero metrics in the last 30s: libbeat.kafka.published_and_acked_events=1789 publish.events=1789 registrar.writes=6 libbeat.kafka.call_count.PublishEvents=6 li
bbeat.publisher.published_events=1789 registrar.states.update=1789
2016-10-07T07:07:05-07:00 INFO Non-zero metrics in the last 30s: publish.events=1622 libbeat.kafka.call_count.PublishEvents=48 libbeat.publisher.published_events=2617 libbeat.kafka.published_an
d_acked_events=1622 registrar.writes=5 registrar.states.update=1622 libbeat.kafka.published_but_not_acked_events=41790
2016-10-07T07:07:35-07:00 INFO Non-zero metrics in the last 30s: libbeat.kafka.call_count.PublishEvents=1246 libbeat.kafka.published_but_not_acked_events=1239770
2016-10-07T07:08:05-07:00 INFO Non-zero metrics in the last 30s: libbeat.kafka.published_but_not_acked_events=1312405 libbeat.kafka.call_count.PublishEvents=1319
2016-10-07T07:08:35-07:00 INFO Non-zero metrics in the last 30s: libbeat.kafka.call_count.PublishEvents=1288 libbeat.kafka.published_but_not_acked_events=1281560
2016-10-07T07:09:05-07:00 INFO Non-zero metrics in the last 30s: libbeat.kafka.call_count.PublishEvents=1254 libbeat.kafka.published_but_not_acked_events=1247730
2016-10-07T07:09:35-07:00 INFO Non-zero metrics in the last 30s: libbeat.kafka.published_but_not_acked_events=1170120 libbeat.kafka.call_count.PublishEvents=1176
2016-10-07T07:10:05-07:00 INFO Non-zero metrics in the last 30s: libbeat.kafka.call_count.PublishEvents=1268 libbeat.kafka.published_but_not_acked_events=1261660
2016-10-07T07:10:35-07:00 INFO Non-zero metrics in the last 30s: libbeat.kafka.call_count.PublishEvents=1221 libbeat.kafka.published_but_not_acked_events=1214895
2016-10-07T07:11:05-07:00 INFO Non-zero metrics in the last 30s: libbeat.kafka.call_count.PublishEvents=1287 libbeat.kafka.published_but_not_acked_events=1280565
2016-10-07T07:11:35-07:00 INFO Non-zero metrics in the last 30s: libbeat.kafka.call_count.PublishEvents=1270 libbeat.kafka.published_but_not_acked_events=1263650
2016-10-07T07:11:35-07:00 WARN client/metadata fetching metadata for all topics from broker 192.168.121.25:9092
2016-10-07T07:12:05-07:00 INFO Non-zero metrics in the last 30s: libbeat.kafka.call_count.PublishEvents=679 libbeat.kafka.published_but_not_acked_events=675605
these are typical tomcat logs but there are lot of debug data inside like http headers, user data provided in forms etc. It's hard to find which line exactly cause this problems becaue we have hundreds lines per second.
Having big multiline events also consider max allowed message size into account. See kafka settings. This limit is enforced/configured in kafka itself. The message size in bytes is total size of json encoded event.
The batch sizes in bytes send to kafka will depend on number of events to be published and the event sizes itself. The bigger the batch, the longer it will take to encode, transmit, decode and store. One can try to limit amount of data send per batch by limiting multiline sizes and reducing bulk_max_size or spooler_size.
2016-10-07T07:11:35-07:00 WARN client/metadata fetching metadata for all topics from broker 192.168.121.25:9092
@steffens About the max size: What is the message returned by Kafka if that happens. In case it is an error / no acked this could explain the repeated events as filebeat constantly resends the event and kafka potential stores a "subset" of the event but not the full event.
@psychonaut@Tim_Burt Could you check if the repeated event is one of the large events or at least in the same batch?
Message about fetching metadata it typical for kafka producers. They need to fetch topography of topics to know where they should write messages. I will try to separate batch which causes error.
For the above log message: I just realised it's coming from the Sarama (Kafka Library) and libbeat itself, so it is not really under our full control. I was more thinking it should probably not be a WARN but an INFO message.
Digging into the code, this can indeed be a problem. Doc says message is dropped, but code returns an error sarama.ErrMessageSizeTooLarge. While sarama would only drop this message, libbeat aggregates all errors for a batch, potentially forcing kafka to resend the complete batch.
@psychonaut maybe you can create an github issue about batches being resend if some event in the batch is too big. I'd really like to mark the issue for the Pioneer Program.
I will prepare a fix on master for just dropping too large messages with an error messsage (output can not tell which fields need to be shortened). Fix should be to adapt sizes in kafka itself + in filebeat.yml kafka output section and/or multiline. This is unfortunately fully up to the user. As these big events are normally traces, no one wants to drop them, but kafka settings can force producers to do so.
I'm thinking about adding a dead-letter queue for a while (e.g. for this use-case), but definitely not in 5.0 release.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.