How to prevent Logstash file input duplicate reading with rotating log?

I have a system that keeps writing logs (with log4j) and it will rotate and compress to a .gz file when the log file reaches 100 MB.

I use Logstash file input plugin to read those log files, and there are several compressed logs already there.

The input I use is like this:

input {
  file {
    path => ["/usr/share/logstash/proj_logs/karaf.log"]
    start_position => "beginning"
    codec => multiline {
      pattern => "^\D"
      what => "previous"
    }
  }
  file {
    path => ["/usr/share/logstash/proj_logs/karaf*.log.gz"]
    start_position => "beginning"
    mode => "read"
    codec => multiline {
      pattern => "^\D"
      what => "previous"
    }
  }
}

However, the log read while tailing the file will also be read again after the file is rotated.
And if I only tail reading the log file, some logs may not be read because of the rotating.

I wonder if there's a proper way to read these log files without duplicate reading.

Also, if there's a multi-line event occurring during file rotating, like a Java stack trace, the event will be sliced (for example, lines 1~3 are at the rotated file, and lines 4~6 are at the log file) and will cause a grok parse failure and doesn't have the right timestamp. Is there any way to deal with this problem?

Thanks.

Currently, I can think of 2 ways to deal with this problem.

  1. Every event read by the file input plugin will have a field path, so I can manually delete those documents read by file tailing (the event generated by file tailing will also be read again at the rotated file) every day by using Cron Job to run a script to call Elasticsearch API.

  2. Disable the file rotating, but it will become larger and larger.

I found that files rotating by only renaming won't read the log duplicated, maybe log4j can disable compressing the rotated file? (tested by a simple code that renames the log file every time when it writes 10 lines) If it's possible, I can use file_completed_action of read mode to delete the read files.

This should be handled by using sincedb. Maybe to reduce write interval to 1 sec, default is 15sec. I wouldn't use additional read for archived files log.gz

Add sincedb database file an try will this help.

input {
  file {
    path => ["/usr/share/logstash/proj_logs/karaf.log"]
    start_position => "beginning"
    sincedb_path => ["/path/file.txt"]
    sincedb_write_interval => 1
    codec => multiline {
      pattern => "^\D"
      what => "previous"
    }
  }

@Rios Thanks for your help, adding sincedb_write_interval surely can reduce the possibility of missing logs during rotating.

However, our system could write logs very fast (sometimes even less than 1 ms). For example, there are some bugs that write lots of stack traces, or multiple inbound/outbound messages written due to multiple users calling REST API.

Assume that the system writes a log every 1 ms, should I set sincedb_write_interval less than 0.001? I am worried that making these settings smaller will cause high CPU or memory usage.

The reason I add the path of archived files is to prevent a log file rotate during 2 readings.
For example, at 00:00.100 the log is read by Logstash, and due to some settings, the new logs written to this file should be discovered at 00:00.200, but the file rotates at 00:00.150. In this case, the logs between 00:00.100 and 00:00.150 won't be read.

However, the compressed rotated log will have a different inode number (renaming won't change the number), which makes Logstash treat it as a new file, which causes duplicate reading.

I wonder if there's any best practice for this use case.

Thanks.

If you are using log4j then don't go via the filesystem. You could go direct to logstash using TCP, but if your log levels are very high I would suggest using kafka as a buffer between your app and logstash.

@Badger Thanks for the reply. I've never thought to change how log4j logs before. It seems like a better way!

What does this mean? Does it mean log writing very fast?

@Badger I find the log4j plugin is deprecated and the page tells me to use Filebeat instead.

After searching on the internet, I find the reason why there's a log4j input plugin is that log4j will serialize the log when using TCP to send the log, and if I use the TCP input plugin the log I receive will be un-readable (not the raw data).

If there's a different way, can you provide a sample to help me understand?

Thanks.

Yes, that is what I meant.

Does FB read faster than LS? Is there any difference between them in speed?

Maybe more important are stat_interval and discover_interval . Also file_chunk_size is interesting, you have to know your line size.

How often (in seconds) we stat files to see if they have been modified. Increasing this interval will decrease the number of system calls we make, but increase the time to detect new log lines. * Default value is "1 second" - maybe to reduce to 300 ms.

These are settings which you may try before Kafka, of course if is possible.

Thanks for the information, I'll try to tune these settings as well.

FB? Facebook?

FB- FileBeat
LS- LogStash

In my environment, I solved the duplicate message problem by creating a Logstash filter that generates a fingerprint based on the source host and the original message field. Then that fingerprint is used as the document ID when it is sent to Elasticsearch. Since I have everything flowing through Logstash this guarantees I never have duplicate messages, regardless of what logging system submits the log. It also protects against log files being re-read after a system restart or roll-over.

Thanks for sharing this method, I think this is a solution to this problem.