Filebeat 5.0beta1 with kafka output: multiplication of log lines


(Krzysztof Krzyzaniak) #1

Followup-to Filebeat 5.0.0-alpha5 multiple harvesters for same file sending the same messages over and over

I have the same problems. After some time messages are marked as un-acked and repeated constantly. It's sudden, in the same time kafka operates without any problems, other nodes don't have problems with sending messages:

From logs:

2016-10-07T07:06:05-07:00 INFO Non-zero metrics in the last 30s: libbeat.publisher.published_events=2735 registrar.states.update=2735 registrar.writes=6 publish.events=2735 libbeat.kafka.call_c
ount.PublishEvents=6 libbeat.kafka.published_and_acked_events=2735
2016-10-07T07:06:35-07:00 INFO Non-zero metrics in the last 30s: libbeat.kafka.published_and_acked_events=1789 publish.events=1789 registrar.writes=6 libbeat.kafka.call_count.PublishEvents=6 li
bbeat.publisher.published_events=1789 registrar.states.update=1789
2016-10-07T07:07:05-07:00 INFO Non-zero metrics in the last 30s: publish.events=1622 libbeat.kafka.call_count.PublishEvents=48 libbeat.publisher.published_events=2617 libbeat.kafka.published_an
d_acked_events=1622 registrar.writes=5 registrar.states.update=1622 libbeat.kafka.published_but_not_acked_events=41790
2016-10-07T07:07:35-07:00 INFO Non-zero metrics in the last 30s: libbeat.kafka.call_count.PublishEvents=1246 libbeat.kafka.published_but_not_acked_events=1239770
2016-10-07T07:08:05-07:00 INFO Non-zero metrics in the last 30s: libbeat.kafka.published_but_not_acked_events=1312405 libbeat.kafka.call_count.PublishEvents=1319
2016-10-07T07:08:35-07:00 INFO Non-zero metrics in the last 30s: libbeat.kafka.call_count.PublishEvents=1288 libbeat.kafka.published_but_not_acked_events=1281560
2016-10-07T07:09:05-07:00 INFO Non-zero metrics in the last 30s: libbeat.kafka.call_count.PublishEvents=1254 libbeat.kafka.published_but_not_acked_events=1247730
2016-10-07T07:09:35-07:00 INFO Non-zero metrics in the last 30s: libbeat.kafka.published_but_not_acked_events=1170120 libbeat.kafka.call_count.PublishEvents=1176
2016-10-07T07:10:05-07:00 INFO Non-zero metrics in the last 30s: libbeat.kafka.call_count.PublishEvents=1268 libbeat.kafka.published_but_not_acked_events=1261660
2016-10-07T07:10:35-07:00 INFO Non-zero metrics in the last 30s: libbeat.kafka.call_count.PublishEvents=1221 libbeat.kafka.published_but_not_acked_events=1214895
2016-10-07T07:11:05-07:00 INFO Non-zero metrics in the last 30s: libbeat.kafka.call_count.PublishEvents=1287 libbeat.kafka.published_but_not_acked_events=1280565
2016-10-07T07:11:35-07:00 INFO Non-zero metrics in the last 30s: libbeat.kafka.call_count.PublishEvents=1270 libbeat.kafka.published_but_not_acked_events=1263650
2016-10-07T07:11:35-07:00 WARN client/metadata fetching metadata for all topics from broker 192.168.121.25:9092

2016-10-07T07:12:05-07:00 INFO Non-zero metrics in the last 30s: libbeat.kafka.call_count.PublishEvents=679 libbeat.kafka.published_but_not_acked_events=675605

Filebeat configuration:

/etc/filebeat/filebeat.yml

--- 
  filebeat: 
    spool_size: 2048
    idle_timeout: "5s"
    registry_file: registry
    publish_async: false
    config_dir: /etc/filebeat/conf.d
  output: 
    kafka: 
      max_retries: 100
      channel_buffer_size: 512
      keep_alive: 0
      worker: 1
      required_acks: -1
      broker_timeout: "30s"
      timeout: "60s"
      hosts: 
        - "am2-kafka01:9092"
        - "am2-kafka02:9092"
        - "am2-kafka03:9092"
        - "am2-kafka04:9092"
        - "am2-kafka05:9092"
      topic: am2-tomcat
  logging: 
    level: info
    to_files: true
    to_syslog: false
    files: 
      path: /var/log/filebeat/
      name: filebeat.log
      rotateeverybytes: 10485760
      keepfiles: 7

/etc/filebeat/conf.d/prospector.yml
filebeat:
prospectors:
- paths:
- /opt/tomcat/current/logs/tomcat.log
encoding: plain
fields:
datacenter: am2
type: tomcat
fields_under_root: false
input_type: log
document_type: log
scan_frequency: 10s
harvester_buffer_size: 16384
tail_files: false
force_close_files: false
backoff: 1s
max_backoff: 10s
backoff_factor: 2
partial_line_waiting: 5s
max_bytes: 10485760
multiline:
pattern: '^(ERROR|INFO|WAR|TRACE|NOTICE|ALERT|CRIT|DEBUG)'
negate: true
match: after

Versions: centos 7.2, filebeat 5.0beta1, kafka 0.10.1. I've tested daily snapshot 6.0-alpha and looks like it doesn't happen with that version.


Filebeat 5.0.0-alpha5 multiple harvesters for same file sending the same messages over and over
(Tim Burt) #2

Hello, I have been working with ruflin to try to narrow this down.

Are you running multiline on the logs?

In my case, it appears that my logs contain as many as 10,000 lines that multiline attempts to join.

This may produce a document that kafka takes some time to receive..

Can you elaborate more on any conditions that you may have that are similar...

Thanks!


(Krzysztof Krzyzaniak) #3

Yes, it's multiline prospector for tomcat logs.

Similarity:

  • multiline
  • peak in kibana (suddenly millions repeated rows instead of thousands)
  • libbeat.kafka.published_but_not_acked_events appeared in filebeat log.


(Tim Burt) #4

I believe Ruflin is in EU and may be done for the week. We will probably
not hear from him until Monday. Hang on till then.

How do the original logs look? My devs are outputing non-logs to the log
files and this I believe is driving mutilline nuts.

In one case, I have 10,000 non-logs surrounding the repeated lines.

It would be helpful if you can confirm the status of the original logs.


(Krzysztof Krzyzaniak) #5

these are typical tomcat logs but there are lot of debug data inside like http headers, user data provided in forms etc. It's hard to find which line exactly cause this problems becaue we have hundreds lines per second.


(Tim Burt) #6

Thanks for the info. Let's see what Ruflin says.


(Steffen Siering) #7

Having big multiline events also consider max allowed message size into account. See kafka settings. This limit is enforced/configured in kafka itself. The message size in bytes is total size of json encoded event.

The batch sizes in bytes send to kafka will depend on number of events to be published and the event sizes itself. The bigger the batch, the longer it will take to encode, transmit, decode and store. One can try to limit amount of data send per batch by limiting multiline sizes and reducing bulk_max_size or spooler_size.


(Krzysztof Krzyzaniak) #8

That actually could be the case. I have set in filebeat prospector max_bytes: 10485760 and in kafka settings message.max.bytes=1000000

I will increase message.max.bytes to size of max_bytes and will check if that helped.

Anyway, should filebeat throw exception some kind when something like this happen?


(ruflin) #9

One thing from the logs which made me curious

2016-10-07T07:11:35-07:00 WARN client/metadata fetching metadata for all topics from broker 192.168.121.25:9092

@steffens About the max size: What is the message returned by Kafka if that happens. In case it is an error / no acked this could explain the repeated events as filebeat constantly resends the event and kafka potential stores a "subset" of the event but not the full event.

@psychonaut @Tim_Burt Could you check if the repeated event is one of the large events or at least in the same batch?


(Krzysztof Krzyzaniak) #10

Message about fetching metadata it typical for kafka producers. They need to fetch topography of topics to know where they should write messages. I will try to separate batch which causes error.


(ruflin) #11

@psychonaut Thanks for doing the checks

For the above log message: I just realised it's coming from the Sarama (Kafka Library) and libbeat itself, so it is not really under our full control. I was more thinking it should probably not be a WARN but an INFO message.


(Steffen Siering) #12

Digging into the code, this can indeed be a problem. Doc says message is dropped, but code returns an error sarama.ErrMessageSizeTooLarge. While sarama would only drop this message, libbeat aggregates all errors for a batch, potentially forcing kafka to resend the complete batch.

@psychonaut maybe you can create an github issue about batches being resend if some event in the batch is too big. I'd really like to mark the issue for the Pioneer Program.

I will prepare a fix on master for just dropping too large messages with an error messsage (output can not tell which fields need to be shortened). Fix should be to adapt sizes in kafka itself + in filebeat.yml kafka output section and/or multiline. This is unfortunately fully up to the user. As these big events are normally traces, no one wants to drop them, but kafka settings can force producers to do so.
I'm thinking about adding a dead-letter queue for a while (e.g. for this use-case), but definitely not in 5.0 release.


(Krzysztof Krzyzaniak) #13

https://github.com/elastic/beats/issues/2735 - if you need I can expand it

I think clear documentation how to adjust values for kafka output would be completely fine for 5.0.


(ruflin) #14

@Tim_Burt @psychonaut Thanks for pushing the investigation on this forward and providing all the data.


(system) #15

This topic was automatically closed after 21 days. New replies are no longer allowed.