Logstash re-processes files, incorrectly updates sincedb

I've been struggling with this intermittent issue for months, where sometimes a log file is read twice into elasticsearch, and sincedb shows incorrect filenames while this is happening.

Apologies for a long post, trying to include relevant information and keep it concise.

The duplication always starts shortly after a log file is rotated. I've observed the following example happen in real-time:

  • The current log file which is being written has inode 15204362, filename "flow.log"

    • sincedb is being updated every 15 seconds as intended
    • The oldest log file in the directory is inode 15204354, filename "flow.log-20240429-1714406426"
    • The oldest entry in sincedb is for a file that no longer exists, with inode 15204363, filename "flow.log-20240429-1714370426"
  • The log is rotated:

    • the old active log file is renamed to "flow.log-20240501-1714579438" and keeps inode 15204362
    • the new active log file is created with inode 15204363, filename "flow.log" (note: same inode as the current oldest entry in sincedb)
  • About 7 seconds later, sincedb is updated:

    • it correctly rewrites the line for inode 15204363, resets the byte counter to a low number (indicating that some data has been processed), and changing the filename to "flow.log"
    • it rewrites the name of the just-rotated log file from "flow.log" to "flow.log-20240428-1714341626" (!!!), keeping inode number 15204362, and keeping the same byte count as it had already.
  • Logrotate now stops processing the new active log file, and starts re-processing inode 15204362, filename "flow.log-20240501-1714579438", which is the file that was just rotated, which Logstash had already finished.

    • sincedb is not updated again until that file is finished processing
    • Once that file has been processed the second time, Logstash then continues processing the active log file and catches up to real-time.

There is clearly something strange here with sincedb assigning the incorrect filename (flow.log-20240428-1714341626) to the just-rotated log file. I don't know where it pulled that filename from, because it didn't exist anywhere at the moment that the log rotation happened. It looks like it did exist in the past but would have been deleted several hours ago, so Logstash must be retaining its filename somewhere and incorrectly writing it here?

From the docs I know that the filename recorded in sincedb shouldn't make a difference as it's the inode and device numbers that are tracked - but it seems like a significant coincidence that this naming error occurs at the same time as the duplicate file processing, so may be relevant.

Despite the file name being wrong in sincedb, Logstash should still not be re-processing the old file, because sincedb still records its inode number with a high byte offset value. Yet, it starts this file again from the beginning. Is there any explanation for this that can be fixed by configuration?

Note: Looking at sincedb now a day later, it has corrected the filename for inode 15204362, not sure at what point that happens.

I have 5 pipelines, and this only occurs on the busiest of them. The well-behaved pipelines process less than 800 events per second each, while the pipeline that has the duplication issue, processes between 2000 and 5500 events per second during normal usage. On the surface, it doesn't seem that Logstash is hitting a performance cap with the 5500 e/s, because when it re-processes old log files, it shoots up to about 20000 e/s, and maintains that for over an hour, so under normal operating conditions it seems that there is still plenty of headroom.

Config info:

Logstash v8.12.2

logstash.yml:

pipeline.workers: 16
pipeline.batch.size: 2500

jvm.options:

-Xms30g
-Xmx30g

Hardware:

128GB RAM
2 x 10 Core CPU @ 2.20GHz
15K RPM SAS disks

The "file" configuration for the problem pipeline is:

file {
    id => "flow-input-file-1"
    path => "/var/log/remote/*/flow/flow.log*"
    start_position => "beginning"
    sincedb_path => "/etc/logstash/sincedb/flow"
    sincedb_write_interval => "15 seconds"
    ignore_older => "60 seconds"
  }

I had the following logging enabled on Logstash during the above log rotation:

    "logger.filewatch.discoverer" : "DEBUG",
    "logger.filewatch.observingtail" : "DEBUG",
    "logger.filewatch.sincedbcollection" : "DEBUG",
    "logger.filewatch.tailmode.handlers.create" : "DEBUG",
    "logger.filewatch.tailmode.handlers.delete" : "DEBUG",
    "logger.filewatch.tailmode.handlers.grow" : "DEBUG",
    "logger.filewatch.tailmode.handlers.shrink" : "DEBUG",
    "logger.filewatch.tailmode.handlers.unignore" : "DEBUG",
    "logger.filewatch.tailmode.processor" : "DEBUG",
    "logger.logstash.inputs.file" : "DEBUG",
    "logger.logstash.outputs.elasticsearch" : "DEBUG"

And I trawled through them and I couldn't see anything interesting around that period of time.

Has anyone encountered this type of issue in the past? I see plenty of people who have log files processed multiple times, but I've not seen anyone report this specific set of symptoms, and the fixes that work for others haven't helped me so far.

Is there any other logging that might be useful for me to be looking at?

Thanks if you've read to the end of this!

You state that inodes are being re-used. Detecting that in the file input is a ridiculously hard problem. Any approach that has a reasonable cost (i.e. not doing checksums on chunks of the file) will sometimes get the answer wrong. See this thread for an example.

And yes, it having the wrong name for a file in the sincedb is indeed strange.

Thanks for responding,

Yes, the filesystem does tend to repeat the same 9 inode numbers over and over within this directory. But I don't think this should be a problem, if I'm understanding the docs correctly.

For example, after the log rotation described above, the following logs showed Logstash reading from the rotated file:

[2024-05-01T17:04:20,867][DEBUG][logstash.inputs.file     ][flow-1][flow-input-file-1] Received line {:path=>"/var/log/remote/node0/flow/flow.log-20240501-1714579438", :text=>"[log text redacted]"}

The inode of that file is 15204362:

[user@hostname flow]$ ls -lai flow.log-20240501-1714579438
15204362 -rw-r-----. 1 root logs 26342214842 May  1 17:03 flow.log-20240501-1714579438

And the sincedb looked like this:

15204362 0 64770 26339422477 1714579437.905359 /var/log/remote/node0/flow/flow.log-20240428-1714341626
15204363 0 64770 9291907 1714579445.711995 /var/log/remote/node0/flow/flow.log
15204354 0 64770 27752873795 1714558876.8575132 /var/log/remote/node0/flow/flow.log-20240429-1714388426
15204355 0 64770 32669638914 1714558876.858567 /var/log/remote/node0/flow/flow.log-20240429-1714406426
15204356 0 64770 28777272165 1714558876.859586 /var/log/remote/node0/flow/flow.log-20240430-1714435226
15204357 0 64770 27509877190 1714558876.8605418 /var/log/remote/node0/flow/flow.log-20240430-1714471226
15204358 0 64770 27641391806 1714558876.862284 /var/log/remote/node0/flow/flow.log-20240430-1714485626
15204359 0 64770 27445463512 1714558876.863333 /var/log/remote/node0/flow/flow.log-20240430-1714503626
15204360 0 64770 26849010647 1714558876.8643382 /var/log/remote/node0/flow/flow.log-20240501-1714543226
15204361 0 64770 29799082374 1714569758.8309429 /var/log/remote/node0/flow/flow.log-20240501-1714564826

The first line there is the inode in question. Despite the incorrect filename in sincedb, my understanding is that this line in sincedb should prevent Logstash re-processing the file from the start, because the inode and device numbers match, and the byte offset is very close to the size of the file in question?

Am I thinking about this right?

Thanks

Trying to get my head around the post you linked,

Is Guy saying that in a case like this, Logstash will treat the renamed file as being a totally new file, assuming inode recycling, and start it from the beginning? If this did happen, shouldn't the byte offset also reset in sincedb when this happens?

Guy wasn't speaking to your case. It was just an example of what can go wrong when you have inode re-use.

Your case does seem strange. It could be a bug. The filewatch library and the file input log details of what it finds in the sincedb and how that affects what it does when you enable TRACE level logging. For the files you have that is going to be a huge amount of logging.

To be honest I do not understand the logic in the file input here, and Guy is no longer around to clarify. If you can reproduce the problem using filebeat to send the files to logstash you might get more attention in the filebeat forum.

Quick question, is this a network share mounted on /var/log/remote or is this just a path in the local disk?

Thanks, I'll see if I can get anywhere with trace logs. I'll see if I can replicate the Logstash workflow with Filebeat too.

It does seem unusual to me that even with Logstash running the whole time, it doesn't work out that a file has just been renamed. It handles the same situation correctly in my other pipelines and it seems like it would be a very common expectation.

Fair question, but no, it's local. Just a directory for logs coming in from "remote" syslog servers.