Filebeat is losing events due to file rotate

Hi!

I'm opening this topic before opening an issue on github.
I hope not made any mistakes, cause it took me quite a long to test it and to report it in here.
Thanks in advance for your kind help.

Regards

Scenario

  • Logstash event generator writing to file "generator.log"
  • FilebeatA reading from "generator.log" and writing to "filebeat.log", rotating files each 1MB
  • FilebeatB reading from "filebeat.log" writing to "final.log", rotating each 3MB (really, rotation on this file can be disabled, but the problem persist)

Please, notice that:

  • If rotation renames file appending .X (eg: file.X) and FilebeatB is listening to file*, most events will be re-procesed (ie: processed twice)
  • If rotation creates a new file named with a timestamp (eg: file.YYYY.MM.dd.hh.mm.ss) and FilebeatB is listening to file*, most events will be re-procesed (ie: processed twice)

Result

FilebeatB seems to be losing events registered on FilebeatA due to rotation.
Events registered on filebeat.log.X doesn't exists on final.log*

Seems FilebeatA rotates file before filebeatB can read last file events.

Details

  • logstash 5.0.0-alpha2 (PS: docker pull logstash:5.0.0-alpha3-1 -> Tag 5.0.0-alpha3-1 not found in repository docker.io/library/logstash :slight_smile:)
  • filebeat version 1.2.3 (amd64)

Configuration files

logstash.yml

input {
    generator{
        add_field => { "from" => "${HOSTNAME}" }
    }
}
filter {}
output {
    stdout {}
    file {
        path => "/tmp/generator.log"
        flush_interval => 0
    }
}

filebeatA.yml

filebeat:
  prospectors:
    -
      paths:
        - /tmp/log*
      input_type: log
      scan_frequency: 0s
output:
  file:
    path: "/tmp"
    filename: filebeat.log
    rotate_every_kb: 1000
    number_of_files: 100
  console:
shipper:
  name: filebeatA
  tags: ["service-X", "web-tier"]
logging:
  to_syslog: false
  to_files: false
  level: info

filebeatB.yml

filebeat:
  prospectors:
    -
      paths:
        - /tmp/filebeat.log
      input_type: log
      scan_frequency: 0s
output:
  file:
    path: "/tmp"
    filename: final.log
    rotate_every_kb: 3000
    number_of_files: 10
  console:
shipper:
  name: filebeatB
  tags: ["service-X", "web-tier"]
logging:
  to_syslog: false
  to_files: false
  level: info

Verification

This command checks for gaps on sequence numbers used by logstash generator.
ls -v filebeat.log* | xargs -i grep generator '{}' | sed 's/.*sequence\\\"://g' | sed 's/,.*//g' | sort -n | awk -v "p=1" '(p==1){p=$1-1}($1!=p+1){print p+1"-"$1-1}{p=$1}END{print "ok"}

This command checks for gaps on sequence number received from filebeat.
ls -v final.log* | xargs -i grep generator '{}' | sed 's/.*sequence\\\\\\"://g' | sed 's/,.*//g' | sort -n | awk -v "p=1" '(p==1){p=$1-1}($1!=p+1){print p+1"-"$1-1}{p=$1}END{print "ok"}'

If the above last command returns at least a pair of numbers, you can check those exists on _filebeat_ but doesn't exists on _final_.**

For example, if printed 344356-352529:
grep -H generator filebeat.log* | grep 344356 prints one line.
grep -H generator final.log* | grep 344356 doesn't print anything.

In fact, doing the query for the previous sequence value (344355), shows the filebeat.log.X filename (grep -H) is different (just truncated).

Proposed solution

  • Each FilebeatA file is created with creation timestamp (eg: filebeat.log.YY-MM-dd-hh-mm-ss), hence FilebeatB will be able to read it until rotation happens.
  • At the moment of rotation, a new file is created with the creation timestamp, but the previous one still exists to be processed by FilebeatB, which will close it after a certain timeout.

@mostolog Thanks a lot for the effort to write down all the details. This helps a lot in investigating and reproducing. I will have a closer look later.

One thing I spotted which started me worrying, is that you use scan_frequency: 0. This should be at least 1s, as it can lead to very high CPU usage and lead to unexpected behaviour. Could you retry it with 1s? Also it would be interesting to see if your issue persists in 5.0.0-alpha3.

I have modified scan_frequency to 1s in both FilebeatA and FilebeatB. As documentation doesn't say anything about "unexpected behaviour", should I open a bug to fix that?

Same behaviour with 5.0.0-alpha3-1: FilebeatB is missing events.

Any other test I could do? Should I fill the issue now?

Just created Issue #5503

I'm currently looking into it and will let you know when I find more details.

Note about the docker images: Be aware that these are no official images from Elastic.

This topic was automatically closed after 21 days. New replies are no longer allowed.