Lag in filebeat tailing the logs from local disk and send to Kafka

filebeat timestamp in the log message is 2019-05-24T17:16:10.063Z;
Log timestamp is: I0524 16:32:40.761113.
therefore, it takes filebeat more than 30 mins to tail a production C++ logs and send them to kafka,
I looked at CPU consumption, it is in 90%+, however, it is not 99% and the from filebeat log, filebeat.events.done can keep up with filebeat.events.added, it does not explain why there is huge delay in sending the log files out to Kafka.

Here are some CPU/IO stats during that time interval, i also copied the filebeats logs at the end, please let me know if anything looks suspicious. TIA for any pointers.

2019/05/24 16:32:02.643 16:31:42 UID PID %usr %system %guest %CPU CPU Command
2019/05/24 16:32:02.643 16:32:02 2147483006 6303 94.35 0.65 0.00 95.00 50 bin/filebeat -path.config /export/content/lid/apps/i

2019/05/24 16:32:02.645 16:31:42 UID PID kB_rd/s kB_wr/s kB_ccwr/s Command
2019/05/24 16:32:02.645 16:32:02 2147483006 6303 0.00 39.60 0.00 bin/filebeat -path.config /export/content/lid/apps/i

2019/05/24 16:33:02.641 16:32:42 UID PID %usr %system %guest %CPU CPU Command
2019/05/24 16:33:02.641 16:33:02 2147483006 6303 90.15 0.70 0.00 90.85 50 bin/filebeat -path.config /export/content/lid/apps/i

2019/05/24 16:33:02.642 16:32:42 UID PID kB_rd/s kB_wr/s kB_ccwr/s Command
2019/05/24 16:33:02.642 16:33:02 2147483006 6303 0.00 36.00 0.00 bin/filebeat -path.config /export/content/lid/apps/i

2019/05/24 16:34:02.641 16:33:42 UID PID %usr %system %guest %CPU CPU Command
2019/05/24 16:34:02.641 16:34:02 2147483006 6303 94.35 0.65 0.00 95.00 50 bin/filebeat -path.config /export/content/lid/apps/i

2019/05/24 16:34:02.643 16:33:42 UID PID kB_rd/s kB_wr/s kB_ccwr/s Command
2019/05/24 16:34:02.643 16:34:02 2147483006 6303 0.00 37.20 0.00 bin/filebeat -path.config /export/content/lid/apps/i

2019/05/24 16:34:02.643 16:33:42 UID PID kB_rd/s kB_wr/s kB_ccwr/s Command
2019/05/24 16:34:02.643 16:34:02 2147483006 6303 0.00 37.20 0.00 bin/filebeat -path.config /export/content/lid/apps/i

The following are filebeat logs from the time interval:

2019-05-24T16:31:35Z INFO Non-zero metrics in the last 30s: beat.info.uptime.ms=29842 beat.memstats.gc_next=101127008 beat.memstats.memory_alloc=62520888 beat.memstats.memory_total=29277883633496 filebeat.events.active=-2049 filebeat.events.added=200703 filebeat.events.done=202752 filebeat.harvester.open_files=46 filebeat.harvester.running=46 libbeat.config.module.running=0 libbeat.output.events.acked=200704 libbeat.output.events.batches=98 libbeat.output.events.total=200704 libbeat.outputs.kafka.bytes_read=150431 libbeat.outputs.kafka.bytes_write=16495561 libbeat.pipeline.clients=1 libbeat.pipeline.events.active=4117 libbeat.pipeline.events.published=200704 libbeat.pipeline.events.total=200704 libbeat.pipeline.queue.acked=200704 registrar.states.current=50 registrar.states.update=201042 registrar.writes=92

2019-05-24T16:32:05Z INFO Non-zero metrics in the last 30s: beat.info.uptime.ms=30076 beat.memstats.gc_next=97412208 beat.memstats.memory_alloc=71622560 beat.memstats.memory_total=29284614364288 filebeat.events.active=1 filebeat.events.added=200705 filebeat.events.done=200704 filebeat.harvester.open_files=46 filebeat.harvester.running=46 libbeat.config.module.running=0 libbeat.output.events.acked=200704 libbeat.output.events.batches=98 libbeat.output.events.total=200704 libbeat.outputs.kafka.bytes_read=155717 libbeat.outputs.kafka.bytes_write=16677835 libbeat.pipeline.clients=1 libbeat.pipeline.events.active=4117 libbeat.pipeline.events.published=200704 libbeat.pipeline.events.total=200704 libbeat.pipeline.queue.acked=200704 registrar.states.current=50 registrar.states.update=200704 registrar.writes=97

2019-05-24T16:32:35Z INFO Non-zero metrics in the last 30s: beat.info.uptime.ms=30041 beat.memstats.gc_next=106706544 beat.memstats.memory_alloc=83175416 beat.memstats.memory_total=29291353320392 filebeat.events.active=-1 filebeat.events.added=196607 filebeat.events.done=196608 filebeat.harvester.open_files=46 filebeat.harvester.running=46 libbeat.config.module.running=0 libbeat.output.events.acked=196608 libbeat.output.events.batches=96 libbeat.output.events.total=196608 libbeat.outputs.kafka.bytes_read=158270 libbeat.outputs.kafka.bytes_write=16390478 libbeat.pipeline.clients=1 libbeat.pipeline.events.active=4117 libbeat.pipeline.events.published=196608 libbeat.pipeline.events.total=196608 libbeat.pipeline.queue.acked=196608 registrar.states.current=50 registrar.states.update=196608 registrar.writes=95

2019-05-24T16:33:05Z INFO Non-zero metrics in the last 30s: beat.info.uptime.ms=29873 beat.memstats.gc_next=98937296 beat.memstats.memory_alloc=80192600 beat.memstats.memory_total=29297836014280 filebeat.events.active=-2009 filebeat.events.added=194599 filebeat.events.done=196608 filebeat.harvester.open_files=46 filebeat.harvester.running=46 libbeat.config.module.running=0 libbeat.output.events.acked=196608 libbeat.output.events.active=-2048 libbeat.output.events.batches=95 libbeat.output.events.total=194560 libbeat.outputs.kafka.bytes_read=149414 libbeat.outputs.kafka.bytes_write=16184406 libbeat.pipeline.clients=1 libbeat.pipeline.events.active=2111 libbeat.pipeline.events.published=194603 libbeat.pipeline.events.total=194602 libbeat.pipeline.queue.acked=196608 registrar.states.current=50 registrar.states.update=196608 registrar.writes=92

2019-05-24T16:33:35Z INFO Non-zero metrics in the last 30s: beat.info.uptime.ms=30029 beat.memstats.gc_next=105970720 beat.memstats.memory_alloc=89646640 beat.memstats.memory_total=29304542031160 filebeat.events.active=2582 filebeat.events.added=201238 filebeat.events.done=198656 filebeat.harvester.open_files=46 filebeat.harvester.running=46 libbeat.config.module.running=0 libbeat.output.events.acked=200704 libbeat.output.events.batches=98 libbeat.output.events.total=200704 libbeat.outputs.kafka.bytes_read=155346 libbeat.outputs.kafka.bytes_write=16654787 libbeat.pipeline.clients=1 libbeat.pipeline.events.active=2642 libbeat.pipeline.events.published=201234 libbeat.pipeline.events.total=201235 libbeat.pipeline.queue.acked=200704 registrar.states.current=50 registrar.states.update=200704 registrar.writes=91

Lagging in tailing logs can be a sign of misconfiguration. Could you share your filebeat config (using </> to preserve formatting)?

1 Like

i copied my filebeat config below. @faec and others, anything looks suspicious to you? thanks!
`filebeat.max_procs: 3
filebeat.prospectors:

  • input_type: log

    paths:
    ${filebeat.log.paths}

    close_removed: true

    ignore_older: 60m
    close_inactive: 5m

    close_timeout: 70m
    clean_inactive: 48h
    clean_removed: true
    symlinks: true

    multiline.pattern: '^[[:digit:]]{4,}[[:punct:]][[:digit:]]{2,}|^[I|W|E|F|D|T][[:digit:]]{4,}'
    multiline.negate: true

    multiline.match: after

processors:

  • drop_fields:
    fields: ["beat.name", "beat.version", "input_type", "@timestamp", "@metadata", "metadata.type", "type", "offset"]

output.kafka:
hosts: ["${kafka.logging.metadata.tls.broker.list}"]

topic: '${filebeat.log.kafka.topic}'
partition.hash:
hash: ['beat.hostname', 'source']
random: true # if false non-hashable events will be dropped

required_acks: 1
compression: gzip
max_message_bytes: 1000000
ssl.certificate_authorities: ["{kafka.ssl.ca}"] ssl.certificate: "{kafka.ssl.certificate}"
ssl.key: "${kafka.ssl.key}"
`

I think you are contentioned by the number of kafka output worker. Your filebeat instance doesn't appear to be able to use more than 100% cpu even if you allowed it to use 3 cores.

Looks like 200K events per 30s is the max your current config can ship out on this hardware using 1 kafka worker. If your kafka servers can actually take more with this partitionning config and everything else staying the same then you should get instant better results with simply more filebeat kafka worker.

Try raising this and report back what changed?

That's an educated guess, the info you posted point to that but doesn't include enough information to form a better guess so far.
You could try with a filbeat file output instead of kafka. That would show if filebeat is contentionned while harvesting... but I doubt that very much, mainly because I think filebeat would use more than 100% cpu if it had issues keeping up on the harvesting side.

1 Like

@martinr_ubi thanks for the pointer!
Just to make sure I understand this correctly, I need to add
worker: 10
under output.kafka section.
How many kafka output worker should I use? we have 100+ kafka partitions. is 10 going over the board, set it to be 5 instead? I will try this tomorrow, and report it back here.

thanks!
yan

Yes, I'm talking about your worker count in your kafka output, same as my worker count in my elasticsearch output, it's critical with batch size and max procs.
For the how many...it depends :slight_smile:

If it works, to learn as you go I would go up until 3, and check if you saturate 3 cores with filebeat when using 3 workers. If that still is not enough to keep up but you see its the right knob I would go higher and check the behavior and the metrics each time. If at 3 you saturate 3 cores I would add another and check if the max procs prevents you from using more cpu even when you add more workers, which is what I would expect. Then in theory to go faster you would need to raise worker and max procs to unlock more cpu for the workers your adding.

The goal is to reach a point where you can keep up, if your kafka setup is not a bottleneck you should able to reach very high rate of events/sec, until its enough to keep up with the logging rate of the app your harvesting log for.

In theory this should be related to max procs, worker count, batch size and the capacity of your kafka setup and everything downstream from filebeat in your pipeline.

You have to keep in mind what the worst case is, if you don't cap the ressources filebeat can use with max procs and worker count because you want to reach the max rate possible on the hardware you risk performance problem on the server if the application ever goes crazy and logs crazy amount of logs. Usually, you aim to also protect the main application on the server as this is more important than logging, only you know how much ressources you want filebeat to be able to use if things ever go sideways for whatever reason. Imagine filebeat is not running or jammed for a while... what happens when it unjams... it uses the 32 cores on the machine and puts pressure on the production workload on that server? That is usually not desired... Don't think about when things are ok, think of what will go terribly wrong, it will someday.

I don't know if others have found better ideas to monitor lateness, the only thing I came up with is to let filebeat add his own "read" timestamp in addition to the original app timestamp. Then you pass all the events in an ingest pipeline in an ES ingest node, the pipeline adds an ingest timestamp. The event ends up with 3 timestamps: creation, read and ingested. With that you can build dashboards and alarms to monitor lateness. One can even substract the timestamps and add lateness timers to the event too. It depends on how you want to track it. The main thing to understand is that usually you build your normal dashboard to use the true event timestamp (creation) and those dashboards are ill equipped to detect and show lateness issues because late events come in late but still "inserts" themselves in the correct place on the timeline. Depending on when you check, you see something different, in the sense that even if it's 19h, events from 18h still come in and you see the bucket from 18:00 to 18:05 is still increasing. Lateness handling and monitoring is a full subject on it's own.

I have (misconfigured) servers which sometimes start shipping me stuff they kept from 3 years ago... nice index and shard explosion when index are automatically created based on true event timestamps. But that depends on deployment size, at some point it's more chaos management than anything else. Everything that can go wrong will, eventually, including logging storm. You don't want to bring anything down with logging, in most cases at least, it's less important than the main production workload.

1 Like

btw, I tried to push the change (adding "worker: 3) to production, it did not help.
From a quick look, it seems like the CPU usages from filebeat got pushed up a bit to 96% (previously low 90's%), however, the kafka events handled in last 30secs is still around 200K.
when I get more time i will do more testing.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.