File input v4.1.2 tailing - duplicate docs ingested

Hi Team,

I'm seeing numerous duplicates (a lot of them) for each individual doc being sent by logstash. I am still unsure whether this is specific to 4.1.x plugin - I will a test with 4.0.3 and confirm this later.

Below is my input config:

  file {
	sincedb_path          => "/foo/bar/.sincedb-foobar"
	max_open_files        => 10000
	close_older           => 60
	ignore_older          => 1296000 # 15 days
	path                  => "/foo/bar/*.log"
	type                  => "grid_node"
	start_position        => "beginning"

	codec => multiline {
	  patterns_dir        => ["${FOOBAR_PATTERNS}"]
	  pattern             => "%{FOOBAR}"
	  negate              => "true"
	  what                => "previous"
	  auto_flush_interval => 150
	  max_lines           => 5000
	}
  }

I suspect the duplication is something to do with the log file being rolled because for other types of logs that do not get rolled I don't see this issue (or maybe not yet):

As shown in the config, logstash watches *.log files in the /foo/bar folder. Now these *.log get rolled over after some time ending with .log.1 then .log.2 then .log.x so on which logstash is not configured to watch.

The duplicate docs all have the path pointing to the original log file ending with .log but when I grep for the particular log line then I would find it in the rolled over file ending with .log.x.

What I don't get is why there so many duplicates, probably hundreds and keeps growing because logstash keeps sending these dupes not-stop, whilst there are only at most .log.x only goes up to maximum .log.3.

Cheers,

So I did a test with the same logstash version and everything else being the same but this time with file input v 4.0.5 and there's no dupes.

I need to understand the exact sequence of actions here. I think there is a bug in the latest code to do with the way we react when a file is rolled.

Please explain the file roll/rotation in as much detail as possible.

Sorry for the late reply.

Our file rotation is done using this python logging handler: https://docs.python.org/2/library/logging.handlers.html#rotatingfilehandler

As far as I know, we don't do anything extravagant - we are just using the above handler then set a max size and the file just gets rolled when it reaches it. The docs above explain pretty well how the handler's rolling mechanism works.

Let me know if you need any further info - will do my best to provide it.

@ld_pvl

I want to work on this issue and the others you have open in one place.

  1. This issue and the read-to-eof-no-delimiter-found-in-current-chunk have the same origin in my opinion. It is a bug (I think) related how the plugin reacts to file rotations in the scheme used.

  2. The last outstanding issue you have regards the very long time it takes to move from the discover phase to the processing phase when initially discovering 500,000 files.

I will look at both of these shortly. 1 can affect other users but 2 is more applicable to your use case.
Both are extremely hard to writing failing test scenarios for though so patience is needed.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.