Scalability Issue: Significant delay in shipping out log files & the delay is dramatically different for different log files!

filebeater · August 22, 2018, 12:51am

We use filebeat to tail log files on disk and ship them to kafka. We observed there is some significant delay between the logs are written to disk and they were processed and sent to kafka. For example:
for READER-6.log, there is nearly 45 seconds delay between C++ log and filebeat processes the log and sends them to kafka.
for READER-24/34.log, the delay is ~38 minutes.
I understand that different log files are processed and sent to kafka independently, i am still surprised to see such big discrepancy between different files. I am wondering is this as expected? and How to reduce this discrepancy among different files such that each file is processed at the same pace.

Left side of "|" is filebeat timestamp, and Right side of "|" is C++ log timestamp.
2018-08-20T23:00:57.680Z|I0820 23:00:12.257315 READER-6.log
2018-08-20T23:00:57.680Z|I0820 23:00:12.258847 READER-6.log
2018-08-20T23:00:57.680Z|I0820 23:00:12.266106 READER-6.log
2018-08-20T23:00:57.680Z|I0820 22:23:20.357524 READER-34.log
2018-08-20T23:00:57.680Z|I0820 22:23:20.357533 READER-34.log
018-08-20T23:00:57.680Z|I0820 22:23:20.357797 READER-34.log
2018-08-20T23:00:57.680Z|I0820 22:23:25.410398 READER-24.log
2018-08-20T23:00:57.680Z|I0820 22:23:25.411065 READER-24.log
2018-08-20T23:00:57.680Z|I0820 22:23:25.414383 READER-24.log

btw, in case it is relevant, we limit the CPU core used by filebeat to 1, and nice it.

export GOMAXPROCS=1
nohup $BASEDIR/bin/filebeat -path.config $BASEDIR/genConfig -path.data $DATADIR -path.logs $BASEDIR/logs >/dev/null 2>/dev/null &
pid=$!
renice 19 -p $pid
cpu=$(lscpu | grep -E '^CPU(' | awk '{ print $2-1 }')
taskset -pc $cpu $pid

I would like to know are we too aggressive in limiting CPU cores used for filebeat to be 1, and nice it?
between limiting CPU codes and nice, which one would cause more problems? Is there an easy way for me to debug how many cores needed to keep up with logs generation on disk so that they can be shipped out timely, ideally within 1 min if not in a few seconds.

thanks,
yan

Christian_Dahlqvist · August 22, 2018, 5:40am

The amount of resources Filebeat requires will depend on the amount ion files monitored and the volume of data collected. Have you in the past experienced issues with Filebeat resource usage as you have limited it so severely? As it is set to nice it will probably have difficulty keeping up whenever the host gets busy, especially if this correlates with larger volumes of logs being generated, but that may be the intention.

filebeater · August 22, 2018, 4:45pm

@Christian_Dahlqvist right, we understand lagging behind depends on the log volume, this happened when we started ramping up traffic (therefore, increased log volume) recently.
I apologize if my last post were unclear. I would like to understand the following:
(1) besides trial and error (gradually increase number of cores given to filebeat and observe ) , is there an easy way to determine how many cores that filebeat need to keep up the current traffic?
(2) Why there is huge discrepancy between delays of shipping out different log files? I assume that each log file is handled by a separate go thread. who is responsible for schedule different go-threads? I am concerned about unfairness in GO threads scheduler.

thanks,
yan

Christian_Dahlqvist · August 22, 2018, 4:52pm

What does you configuration look like? Are the volumes for the files suffering from differing delay similar?

filebeater · August 22, 2018, 4:55pm

What configuration file? you meant filebeat.yaml file?
As I posted above: it is 45 seconds delay for one file, and 38 minutes for another two files. These log files are generated in a similar way.

filebeater:

for READER-6.log, there is nearly 45 seconds delay between C++ log and filebeat processes the log and sends them to kafka.
for READER-24/34.log, the delay is ~38 minutes.
I understand that different log files are processed and sent to kafka independently, i am still surprised to see such big discrepancy between different files. I am wondering is this as expected? and How to reduce this discrepancy among different files such that each file is processed at the same pace.

Left side of "|" is filebeat timestamp, and Right side of "|" is C++ log timestamp.
2018-08-20T23:00:57.680Z|I0820 23:00:12.257315 READER-6.log
2018-08-20T23:00:57.680Z|I0820 23:00:12.258847 READER-6.log
2018-08-20T23:00:57.680Z|I0820 23:00:12.266106 READER-6.log
2018-08-20T23:00:57.680Z|I0820 22:23:20.357524 READER-34.log
2018-08-20T23:00:57.680Z|I0820 22:23:20.357533 READER-34.log
018-08-20T23:00:57.680Z|I0820 22:23:20.357797 READER-34.log
2018-08-20T23:00:57.680Z|I0820 22:23:25.410398 READER-24.log
2018-08-20T23:00:57.680Z|I0820 22:23:25.411065 READER-24.log
2018-08-20T23:00:57.680Z|I0820 22:23:25.414383 READER-24.log

Christian_Dahlqvist · August 22, 2018, 4:58pm

Yes.

I saw that, but it was not clear to me whether these files grow at roughly the same pace or not.

filebeater · August 22, 2018, 4:59pm

Yes, they grow at the same pace, the traffic is distributed randomly among different readers.

filebeater · August 22, 2018, 5:00pm

Here is the filebeat.yaml file:
#=========================== Filebeat prospectors =============================
filebeat.prospectors:

input_type: log

paths:
${filebeat.log.paths}

close_removed: true

ignore_older: 60m
close_inactive: 5m

close_timeout: 70m
clean_inactive: 48h
clean_removed: true
symlinks: true

multiline.pattern: '^[[:digit:]]{4,}|^[I|W|E|F|D|T][[:digit:]]{4,}'

multiline.negate: true

multiline.match: after

processors:

drop_fields:
fields: ["beat.name", "beat.version", "input_type", "@timestamp", "type", "offset"]

#-------------------------- Kafka output ------------------------------
output.kafka:
hosts: ["${kafka.logging.metadata.tls.broker.list}"]

topic: '${filebeat.log.kafka.topic}'
partition.hash:
hash: ['beat.hostname']
random: true # if false non-hashable events will be dropped

required_acks: 1
compression: gzip
max_message_bytes: 1000000
ssl.certificate_authorities: ["${kafka.ssl.ca}"]
ssl.certificate: "${kafka.ssl.certificate}"
ssl.key: "${kafka.ssl.key}"

system · September 19, 2018, 5:00pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Lag in filebeat tailing the logs from local disk and send to Kafka Beats filebeat	7	1136	June 27, 2019
Filebeat has big delay to send log to Kafka Beats	1	305	July 29, 2021
Filebeat collect log file exists delay Beats filebeat	2	477	May 16, 2018
Filebeat - Log Processing Issues/Delay/Data Loss Beats filebeat	0	75	October 16, 2024
Filebeat 7.6.2_linux_x86_64 is not keeping up with the log entries added to .log files Beats filebeat	3	398	March 31, 2021

Scalability Issue: Significant delay in shipping out log files & the delay is dramatically different for different log files!

Related topics