Scalability Issue: Significant delay in shipping out log files & the delay is dramatically different for different log files!

We use filebeat to tail log files on disk and ship them to kafka. We observed there is some significant delay between the logs are written to disk and they were processed and sent to kafka. For example:
for READER-6.log, there is nearly 45 seconds delay between C++ log and filebeat processes the log and sends them to kafka.
for READER-24/34.log, the delay is ~38 minutes.
I understand that different log files are processed and sent to kafka independently, i am still surprised to see such big discrepancy between different files. I am wondering is this as expected? and How to reduce this discrepancy among different files such that each file is processed at the same pace.

Left side of "|" is filebeat timestamp, and Right side of "|" is C++ log timestamp.
2018-08-20T23:00:57.680Z|I0820 23:00:12.257315 READER-6.log
2018-08-20T23:00:57.680Z|I0820 23:00:12.258847 READER-6.log
2018-08-20T23:00:57.680Z|I0820 23:00:12.266106 READER-6.log
2018-08-20T23:00:57.680Z|I0820 22:23:20.357524 READER-34.log
2018-08-20T23:00:57.680Z|I0820 22:23:20.357533 READER-34.log
018-08-20T23:00:57.680Z|I0820 22:23:20.357797 READER-34.log
2018-08-20T23:00:57.680Z|I0820 22:23:25.410398 READER-24.log
2018-08-20T23:00:57.680Z|I0820 22:23:25.411065 READER-24.log
2018-08-20T23:00:57.680Z|I0820 22:23:25.414383 READER-24.log

btw, in case it is relevant, we limit the CPU core used by filebeat to 1, and nice it.

export GOMAXPROCS=1
nohup $BASEDIR/bin/filebeat -path.config $BASEDIR/genConfig -path.data $DATADIR -path.logs $BASEDIR/logs >/dev/null 2>/dev/null &
pid=$!
renice 19 -p $pid
cpu=$(lscpu | grep -E '^CPU(' | awk '{ print $2-1 }')
taskset -pc $cpu $pid

I would like to know are we too aggressive in limiting CPU cores used for filebeat to be 1, and nice it?
between limiting CPU codes and nice, which one would cause more problems? Is there an easy way for me to debug how many cores needed to keep up with logs generation on disk so that they can be shipped out timely, ideally within 1 min if not in a few seconds.

thanks,
yan

The amount of resources Filebeat requires will depend on the amount ion files monitored and the volume of data collected. Have you in the past experienced issues with Filebeat resource usage as you have limited it so severely? As it is set to nice it will probably have difficulty keeping up whenever the host gets busy, especially if this correlates with larger volumes of logs being generated, but that may be the intention.

1 Like

@Christian_Dahlqvist right, we understand lagging behind depends on the log volume, this happened when we started ramping up traffic (therefore, increased log volume) recently.
I apologize if my last post were unclear. I would like to understand the following:
(1) besides trial and error (gradually increase number of cores given to filebeat and observe ) , is there an easy way to determine how many cores that filebeat need to keep up the current traffic?
(2) Why there is huge discrepancy between delays of shipping out different log files? I assume that each log file is handled by a separate go thread. who is responsible for schedule different go-threads? I am concerned about unfairness in GO threads scheduler.

thanks,
yan

What does you configuration look like? Are the volumes for the files suffering from differing delay similar?

What configuration file? you meant filebeat.yaml file?
As I posted above: it is 45 seconds delay for one file, and 38 minutes for another two files. These log files are generated in a similar way.

Yes.

I saw that, but it was not clear to me whether these files grow at roughly the same pace or not.

Yes, they grow at the same pace, the traffic is distributed randomly among different readers.

Here is the filebeat.yaml file:
#=========================== Filebeat prospectors =============================
filebeat.prospectors:

  • input_type: log

    paths:
    ${filebeat.log.paths}

    close_removed: true

    ignore_older: 60m
    close_inactive: 5m

    close_timeout: 70m
    clean_inactive: 48h
    clean_removed: true
    symlinks: true

    multiline.pattern: '^[[:digit:]]{4,}|^[I|W|E|F|D|T][[:digit:]]{4,}'

    multiline.negate: true

    multiline.match: after

processors:

  • drop_fields:
    fields: ["beat.name", "beat.version", "input_type", "@timestamp", "type", "offset"]

#-------------------------- Kafka output ------------------------------
output.kafka:
hosts: ["${kafka.logging.metadata.tls.broker.list}"]

topic: '${filebeat.log.kafka.topic}'
partition.hash:
hash: ['beat.hostname']
random: true # if false non-hashable events will be dropped

required_acks: 1
compression: gzip
max_message_bytes: 1000000
ssl.certificate_authorities: ["${kafka.ssl.ca}"]
ssl.certificate: "${kafka.ssl.certificate}"
ssl.key: "${kafka.ssl.key}"

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.