Filestream input sends duplicates events on restart and during operation

@michaelbu thanks once more for all the information!

Yes, you can send me a private message here on discuss and we can find the best way to send me those logs or even how to collect the logs without the sensitive data.

Do you think collecting the own logs by a journald-input instead of filestream is a good workaround for this? If so, is there a way to get this output in json?

I don't think there is much difference. The journald input supports processors and there is a decode JSON fields processor you can use.

In my understanding something make the registry break and if this happens this will not be repaired and the logfile will be read over and over again until it's closed?

If the data (e.g: the cursor) is inconsistent in the registry, when Filebeat starts again it will read from the offset recorded there. If the cursor is smaller than the current file size, then Filebeat will start reading from that point, if the offset is bigger the the file size, then Filebeat will interpret it as a truncation.

That is really puzzling me because Filebeat does not truncate its log files, I've been reviewing our log rotation code today and the way it works in Beats, given rotation is needed, is:

  • The current file is closed
  • The new name is generated (the default format is: filebeat-20230612-16.ndjson, the 16 is a ever-increasing counter)
  • If needed, old files are deleted
  • The new file is created/opened.

This means Filebeat will not truncate its own log files.

I still need to investigate more, but today I managed to reproduce this ever-increasing cursor in the registry in a very peculiar situation:

  • Filebeat harvesting its own logs
  • I truncate the logs using truncate --size=0 /path/to/file
  • Most of the time, the cursor in the registry keeps increasing, however:
    • stat path/to/file show the new size as zero for a brief period of time, then it jumps back to what was in the registry
    • wc -l confirms that the file has been truncated and correctly shows the line count: 0 after truncation then slowly increasing.

I'm still investigating.

@michaelbu I've been working on this for the last couple of days, here is the update I have.

  1. It is possible that a truncated file does not get smaller.
    I managed to reproduce it by truncating (truncate --size=0 path/to/file) Filebeat's own log file while Filebeat was still
    running. stat or ls will show that the file size does not
    reduce, however for a second or so, they might show it as 0. I
    did not dig deeper on why this happens, but I suspect that because
    Filebeat still holds a file descriptor to write to that file, when
    Filebeat writes after the truncation, the metadata gets "reset" to
    what was on that file descriptor. I also tested that on RHEL 7
    with xfs and the behaviour was the same.
  2. On my final tests, Filebeat was always reading and saving in the
    registry an offset compatible with what was returned by stat.
  3. Given #1 and #2, it is possible that on a restart of Filebeat, the
    OS/file system will "fix" the metadata and when Filebeat opens the
    file, it will detect the truncation and start reading the file from
    the beginning. It is also possible that between Filebeat closing
    and opening the file again, the OS/file system will also "fix" the
    metadata, leading Filebeat to detect a truncation and start reading
    the file from the beginning again.
  4. Carefully looking at the logs you provided I can see an interesting
    pattern:
    1. Filebeat is reading the file
    2. It reaches EOF
    3. At some point it closes the file
    4. The file gets updated
    5. Filebeat opens the file, its size is smaller than what is in the
      registry, it's a truncation, set offset to 0.
      This seems to me that the metadata Filebeat is receiving from the
      OS is changing between the file being closed and opened
      again. (see pint #3)
  5. Regarding Filebeat own log file, it gets even more weird. The file
    is first moved, then a little after, when Fileberat re-opens it for
    reading, it detects the truncation. Which are operations Filebeat
    does not do with its own log files.

It really seems to me that something external is truncating those
files, which can lead to this metadata issue I described above.

Have you managed to reproduce it on a controlled environment? E.g:
Without using Graylog and with full control of the log that Filebeat
is reporting as truncated?

An easy option to generate logs is a script like this:

#!/bin/sh

i=0
while true
do
    let i++
    echo "$(date) - $i" >> /tmp/log.log
    sleep 1
done

The issue I mentioned earlier is not exactly your case (the code paths
are slightly different), here is the link to the PR:
Fix filestream flaky truncated file test by belimawr ยท Pull Request #35759 ยท elastic/beats ยท GitHub.

Hi @TiagoQueiroz,

thank you for all the efforts to reproduce this!

It really seems to me that something external is truncating those
files, which can lead to this metadata issue I described above

Can you provide a go code snippet to detect the truncation?
In my understanding we can isolate the problem if this little program also detects wrong metadata with fi.Size()?
It can run on the affected node in parallel without Graylog.

Is fi.Size() a standart go library?

I wonder why does the old log-Input do not detect the truncates?

Btw we update filebeats version next monday from 8.7.0 to 8.8.1.

Hi @TiagoQueiroz,

That is really puzzling me because Filebeat does not truncate its log files, I've been reviewing our log rotation code today and the way it works in Beats, given rotation is needed, is

Of course you're right: I understand why filebeat logs it's own logs as truncated in some of my tests.
This does not affects the debug logs that I've send to you which are created in production.

This happens in my tests operating filebeat at the commandline in this steps:

  • Stop filebeat which have collected it's own logs
  • Delete the old logfiles of filebeat
  • Start filebeat
  • filebeat writes the logfile filebeat-YYYYMMDD.ndjson which he already read before the start
  • Because the logfile is new, he sees this as truncated

Yes, fi is a os.FileInfo, the value returned by os.Stat

Your script should be able to detect the truncation as long as it is quick enough to catch the moment the file size is 0. How quick new data is being added to the log file will play a major role here. I managed to easily reproduce it with data being added only every few seconds.

I'd recommend to use stat -c %s <file path> instead of ls to get the file size.

Any program/script that reads/tracks the file size is going to work, it just needs to be "fast enough" to detect the drop in size then it going back up if your case is the one I managed to reproduce and described on 'Filestream input sends duplicates events on restart and during operation - #22 by TiagoQueiroz'

If you're using the default file identity (as your configuration shows), that should not happen. Filebeat identify files by inode + devide ID, so even if a file is renamed, Filebeat will keep the correct state. If a file is deleted and a new one with the same name/path is created, Filebeat will correctly identify it as a new file (different inode) and start reading from the beginning.

Did you really see the log lines stating the file was truncated on this test? I tried it, but could not reproduce it. If so, could you use stat on the "old" and "new" log files and check their inodes and device ID?

I can only think of a small corner case where this might happen: if the file system is re-using inodes, so the new log file ends up having the same inode as the delete done, hence Filebeat treats them as the same file.

However, this should not be an issue to collect Filebeat logs, Filebeat rotates its own log files and does not re-use filenames, so you should experience no issue if the files are not manually deleted.

We have a task in progress to implement a file identifier that does not rely on the file system: [Filebeat] Implement a more stable file identifier ยท Issue #34419 ยท elastic/beats ยท GitHub

Hi @TiagoQueiroz,

thank you very much for all your thoughts and informations.

Have you managed to reproduce it on a controlled environment? E.g:
Without using Graylog and with full control of the log that Filebeat
is reporting as truncated?

It took a while but I managed to reproduce without Graylog. So Sidecar is not part of the problem.
I also updated to the latest filebeat version 8.8.1

I've started a filebeat at the commandline like:

nohup /usr/share/filebeat/bin/filebeat -c /tmp/test-filebeat/filebeat.conf --path.home /usr/share/filebeat &

After seeing that filebeat logs truncaions I stopped him.

Now there is a strange behaviour:
Starting him with debug enabled, he logs in the start phase, that files have been truncated.

I would like to share the following informations with you. Maybe you have an idea what's going wrong? Because I expect filebeat to correct the wrong value of the cursor in the registry after the fresh start?

  • Skript g that starts the filebeat
  • Configuration file filebeat.conf
  • The registry in directory data
  • The logs in log-prod
  • The stats of the logfiles

I'll send you the logs by mail.

Thanks for your patiance!
Next step is trying to reproduce it with an artifical log as you have suggested.

Regards
Michael

1 Like

That is normal, there are some different places where Filebeat will log about truncation, one is when opening the file (which always happens when Filebeat starts, but also happens during a normal operation if the file is closed), in that case the log line looks like this:

{"log.level":"info","@timestamp":"2023-05-24T12:42:42.576+0200","log.logger":"input.filestream","log.origin":{"file.name":"filestream/input.go","file.line":267},"message":"File was truncated. Reading file from offset 0. Path=/tmp/foo/bar.log.log","service.name":"filebeat","id":"foo-bar","source_file":"filestream::foo-bar::native::12345678-12345","path":"/tmp/foo/bar.log.log","state-id":"native::12345678-12345","ecs.version":"1.6.0"}

This means that the stat information Filebeat got when opening the file contains a size that is smaller than the offset in the registry.

I looked at the data you sent by email, I still can't find anything out of the ordinary there.

Once more I looked at the code and did some testing, however I cannot find any issues. The only thing I can see is Filebeat reacting to the information it is receiving from the OS/file system.

If you manage to do that then I'll be able to reproduce the issue on my dev environment and if there is a bug, I'll be able to spot and fix it.

If you don't manage to reproduce on a controlled environment, then the only explanation I can find is that something on your environment is messing with those files.

That is normal, there are some different places where Filebeat will log about truncation, one is when opening the file (which always happens when Filebeat starts, but also happens during a normal operation if the file is closed), in that case the log line looks like this:

I'm afraid this is a misunderstanding. I observed the this behaviour:

  • I started filebeat with nohup and let him read the logs on the productive system.
  • Once he logs the truncates I stop him.
  • When I start him again he logs the truncates at start time and in the 10 minute interval repeatingly.
  • To check what the size of the file and the cursor value is, I modified the code like this:
$ filebeat$ git diff
diff --git a/filebeat/input/filestream/input.go b/filebeat/input/filestream/input.go
index 2bd53c0e4d..a1ab0315fd 100644
--- a/filebeat/input/filestream/input.go
+++ b/filebeat/input/filestream/input.go
@@ -264,7 +264,7 @@ func (inp *filestream) openFile(log *logp.Logger, path string, offset int64) (*o
        }

        if fi.Size() < offset {
-               log.Infof("File was truncated. Reading file from offset 0. Path=%s", path)
+               log.Infof("File was truncated. Reading file from offset 0. Path=%s (fi.Size=%v, offset=%v)", path, fi.Size(), offset)
                offset = 0
        }
        err = inp.initFileOffset(f, offset)
  • I restarted filebeat once again and see this logs:
-bash-4.2$ cat log-prod/filebeat-20230623.ndjson | jq -r '. | "\(."@timestamp") \(."log.level") \(.message)"' | grep trunc | grep XXX
2023-06-23T08:31:54.688+0200 info File was truncated. Reading file from offset 0. Path=/var/db2/auditlog/XXX/db2audit_extract_db-23-06-23.log (fi.Size=2466303, offset=99318094)
2023-06-23T08:45:06.700+0200 info File was truncated. Reading file from offset 0. Path=/var/db2/auditlog/XXX/db2audit_extract_db-23-06-23.log (fi.Size=2528244, offset=101804783)
2023-06-23T09:05:06.696+0200 info File was truncated. Reading file from offset 0. Path=/var/db2/auditlog/XXX/db2audit_extract_db-23-06-23.log (fi.Size=2610962, offset=104374190)
2023-06-23T09:25:06.701+0200 info File was truncated. Reading file from offset 0. Path=/var/db2/auditlog/XXX/db2audit_extract_db-23-06-23.log (fi.Size=2693680, offset=107026707)
2023-06-23T09:40:06.699+0200 info File was truncated. Reading file from offset 0. Path=/var/db2/auditlog/XXX/db2audit_extract_db-23-06-23.log (fi.Size=2755621, offset=109741165)

In my understanding this happened:

  • "Something" might currupted the metadata of the file and we do not know what is the cause of that.
  • After restarting the filebeat he should reread the whole log and update the cursor, but it seems to me that the cursor value wil never be set to the right value?
  • in the logs you can see that fi.Size() and cursor grows but I expect cursor to be set to the right value?
  • I wonder if filebeat handles the inital error the wrong way?

Btw. I'm sure that the production logfile is written every 5 minutes.

Do you think I'm wrong?

If filebeat logs truncation I expect to see the cursor value to become smaller if the file has been reread?

$ cat data/registry/filebeat/log.json | grep /var/db2/auditlog/XXX/db2audit_extract_db-23-06-23.log | jq -c '.v' | jq -r '. | "\(."updated") \(."cursor")"'
[516032564365,1687384517] {"offset":6126124}
[515973095982,1687471506] {"offset":6182979}
[515974081884,1687471511] {"offset":6183361}
[516023829415,1687471810] {"offset":6211005}
[516024789221,1687471815] {"offset":6211397}
[516024792767,1687472408] {"offset":6266694}
[516050609850,1687472408] {"offset":6352369}
[516050737377,1687472413] {"offset":6352761}
[516193662436,1687472712] {"offset":6391375}
[516193836766,1687472717] {"offset":6391767}
[516233760276,1687473006] {"offset":6420195}
[516235719806,1687473011] {"offset":6420586}
[516087666420,1687473606] {"offset":6685455}
[516088806721,1687473611] {"offset":6685847}
[516137415198,1687473910] {"offset":6714274}
[516137905925,1687473915] {"offset":6714666}
[515989559184,1687474506] {"offset":7064435}
[515990489694,1687474511] {"offset":7064827}
[516019744025,1687474810] {"offset":7093637}
[516020474857,1687474815] {"offset":7094028}
[516065816238,1687475104] {"offset":7122064}
[516071516034,1687475109] {"offset":7122446}
[515978785913,1687475706] {"offset":7256008}
[516075623389,1687475706] {"offset":7520101}
[516113548856,1687475706] {"offset":7586689}
[516113759955,1687475711] {"offset":7587081}
[516150867612,1687476010] {"offset":7614343}
[516151232994,1687476015] {"offset":7614735}
[515481042786,1687476609] {"offset":7869488}
[515886076340,1687476609] {"offset":8144795}
[515921480475,1687476609] {"offset":8163487}
[515934061796,1687476614] {"offset":8163879}
[516281317264,1687477206] {"offset":8586071}
[516394822919,1687477206] {"offset":8771070}
[516394876293,1687477211] {"offset":8771462}
[515440390288,1687477511] {"offset":8800653}
[515440496935,1687477516] {"offset":8801045}
[515479954535,1687477805] {"offset":8828307}
[515481191581,1687477810] {"offset":8828699}
[516128052685,1687478406] {"offset":9062406}
[516316018961,1687478406] {"offset":9550364}
[516317840460,1687478411] {"offset":9550756}
[516370435904,1687478710] {"offset":9578791}
[516371281970,1687478715] {"offset":9579183}
[515996720966,1687479306] {"offset":9993199}
[516052195084,1687479306] {"offset":10385748}
[516053156722,1687479311] {"offset":10386140}
[516095855488,1687479610] {"offset":10414176}
[516096629906,1687479615] {"offset":10414567}
[516119828478,1687480206] {"offset":10859745}
[516380004274,1687480206] {"offset":11275726}
[515437581820,1687480207] {"offset":11316220}
[515440514468,1687480212] {"offset":11316612}
[515489328527,1687480511] {"offset":11345029}

Isn't it suspicous that the cursor is constantly growing?

Do you think it's possible that filebeat does not handle "that something on our environment is messing with those files" properly?

I understand that you are not able to fix something that only occures in our environment and try to reproduce. I worry that I do not manage to reproduce and we are forced to use the deprecated log-input permanently.

Thanks for all your efforts and your patiance!

Kind regards
Michael

I would phrase it differently: Filebeat is responding to events/environment changes the way it is supposed to (at least based on what I could gather so far).

However, there is something happening to those log files that is not expected by you and me, which leads Fielbeat to behave in a way we do not expect.

If the log input works for you, go for it, it is deprecated, but it won't be removed in the 8.x series, so there is time to find a solution without disrupting your production environment.

Without being able to reproduce it on a isolated environment/experiment, I won't be able to identify whether it's a bug in how filebeat is reading information from the file system or something external is affecting the log files in a way we do not want. Given you're not managing to reproduce it, I believe the latter is the likely reason for the issues you're facing.

Thanks for all your patience and hard working helping out on this investigation! I'll gladly jump back in whenever there is new information.

It depends, I managed to have this exact behaviour while truncating Filebeat's log file while it was also being harvested by the same Filebeat process. Even stat or ls would show the file size ever increasing, with a split second drop to 0 bytes. But once Filebeat wrote again to the file, the size would jump back in to what it was before the truncation + the new data. It really seems to be a file system behaviour (I also tested it with xfs) that if a process holds the file handle while it's externally truncated, the next write will "bring" the size back to what it was, however the data in the file is "only" the new data.

One thing I did not test was to stop all process holding handlers to that file and wait for a while to see if the size would change again.

On all tests I performed the offset in the registry always reflected the same information returned by stat or ls. So Filebeat behaved as expected.

It is also important to remember the registry updates on disk happen once every second (default configuration). So you might not see some intermediate states there.

If you want to read the timestamps in the registry, you can use this application: GitHub - belimawr/registry-decoder: Filebeat registry decoder/pretty-printer. I wrote it while investigating some issues. It works well, but the code is not the most beautiful thing as it was a quick hack :see_no_evil:

However, there is something happening to those log files that is not expected by you and me, which leads Fielbeat to behave in a way we do not expect.

Exactly! :slight_smile:

I've managed to create a truncate with an artificial log. This is the setup:

This script creates a logfile:

i=0
while true; do
    let i++
    for j in `eval echo {1..10}`; do
        echo "$(date) - $i - $j"  >> /tmp/test-filebeat/test.log
    done
    sleep 300
done

A second script rotates the log:

#!/bin/bash
LOG=/tmp/test-filebeat/test.log
while true; do
    SIZE=$(stat -c%s "${LOG}")
    if [ ${SIZE} -gt 2048 ]; then
        sleep 3
        mv -f ${LOG} ${LOG}.1
        touch ${LOG}
    fi
    sleep 4
done

The filebeat configuration:

logging:
  level: info
  files:
    keepfiles: 2
  metrics:
    enabled: false

output.logstash:
  hosts: ["XXX:5046", "XXX.lvm.de:5046", "XXX.lvm.de:5046"]
  loadbalance: true
  ssl.verification_mode: full

path:
  data: /tmp/test-filebeat/data
  logs: /tmp/test-filebeat/log
  config : /etc/filebeat

filebeat.inputs:
- enabled: true
  id: test.log
  type: filestream
  encoding: plain
  paths:
  - /tmp/test-filebeat/test.log
  prospector.scanner.check_interval: 2s

If you want to read the timestamps in the registry, you can use this application: GitHub - belimawr/registry-decoder: Filebeat registry decoder/pretty-printer.

Thanks for this tool!
I've just used it and I'm sorry to bother you again - But I see that some timestampe seems to be no correct like the updated in the third entry?

$ ./registry-decoder log.json | grep test.log

{"k":"filestream::test.log::native::1305-64790","v":{"cursor":{"offset":1604},"meta":{"source":"/tmp/test-filebeat/test.log","identifier_name":"native"},"ttl":1800000000000,"updated":"2023-06-23T18:10:25.568418028+02:00"}}
{"k":"filestream::test.log::native::1305-64790","v":{"cursor":{"offset":2406},"meta":{"source":"/tmp/test-filebeat/test.log","identifier_name":"native"},"ttl":1800000000000,"updated":"2023-06-23T18:20:25.568849086+02:00"}}
{"k":"filestream::test.log::native::1305-64790","v":{"cursor":{"offset":2406},"meta":{"source":"/tmp/test-filebeat/test.log","identifier_name":"native"},"ttl":0,"updated":"2023-06-23T18:20:25.568849086+02:00"}}
{"k":"filestream::test.log::native::1302-64790","v":{"cursor":null,"meta":{"source":"/tmp/test-filebeat/test.log","identifier_name":"native"},"ttl":0,"updated":"0000-12-31T23:55:44Z"}}
{"k":"filestream::test.log::native::1302-64790","v":{"cursor":null,"meta":{"source":"/tmp/test-filebeat/test.log","identifier_name":"native"},"ttl":1800000000000,"updated":"2023-06-23T18:20:27.568606412+02:00"}}
...

The timestamp in the registrys raw data seem sto be corrupt?

cat log.json | grep test.log
{"k":"filestream::test.log::native::1305-64790","v":{"ttl":1800000000000,"updated":[515964493548,1687536625],"cursor":{"offset":1604},"meta":{"source":"/tmp/test-filebeat/test.log","identifier_name":"native"}}}
{"k":"filestream::test.log::native::1305-64790","v":{"ttl":1800000000000,"updated":[515964924606,1687537225],"cursor":{"offset":2406},"meta":{"source":"/tmp/test-filebeat/test.log","identifier_name":"native"}}}
{"k":"filestream::test.log::native::1305-64790","v":{"ttl":0,"updated":[515964924606,1687537225],"cursor":{"offset":2406},"meta":{"identifier_name":"native","source":"/tmp/test-filebeat/test.log"}}}
{"k":"filestream::test.log::native::1302-64790","v":{"ttl":0,"updated":[281470681743360,18446744011573954816],"cursor":null,"meta":{"source":"/tmp/test-filebeat/test.log","identifier_name":"native"}}}
{"k":"filestream::test.log::native::1302-64790","v":{"meta":{"source":"/tmp/test-filebeat/test.log","identifier_name":"native"},"ttl":1800000000000,"updated":[515964681932,1687537227],"cursor":null}}
...

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.