File input v4.1.2 tailing - duplicate docs ingested

ld_pvl · May 4, 2018, 2:52pm

Hi Team,

I'm seeing numerous duplicates (a lot of them) for each individual doc being sent by logstash. I am still unsure whether this is specific to 4.1.x plugin - I will a test with 4.0.3 and confirm this later.

Below is my input config:

  file {
	sincedb_path          => "/foo/bar/.sincedb-foobar"
	max_open_files        => 10000
	close_older           => 60
	ignore_older          => 1296000 # 15 days
	path                  => "/foo/bar/*.log"
	type                  => "grid_node"
	start_position        => "beginning"

	codec => multiline {
	  patterns_dir        => ["${FOOBAR_PATTERNS}"]
	  pattern             => "%{FOOBAR}"
	  negate              => "true"
	  what                => "previous"
	  auto_flush_interval => 150
	  max_lines           => 5000
	}
  }

I suspect the duplication is something to do with the log file being rolled because for other types of logs that do not get rolled I don't see this issue (or maybe not yet):

As shown in the config, logstash watches *.log files in the /foo/bar folder. Now these *.log get rolled over after some time ending with .log.1 then .log.2 then .log.x so on which logstash is not configured to watch.

The duplicate docs all have the path pointing to the original log file ending with .log but when I grep for the particular log line then I would find it in the rolled over file ending with .log.x.

What I don't get is why there so many duplicates, probably hundreds and keeps growing because logstash keeps sending these dupes not-stop, whilst there are only at most .log.x only goes up to maximum .log.3.

Cheers,

ld_pvl · May 4, 2018, 5:04pm

So I did a test with the same logstash version and everything else being the same but this time with file input v 4.0.5 and there's no dupes.

guyboertje · May 4, 2018, 7:55pm

I need to understand the exact sequence of actions here. I think there is a bug in the latest code to do with the way we react when a file is rolled.

Please explain the file roll/rotation in as much detail as possible.

ld_pvl · May 5, 2018, 7:59pm

Sorry for the late reply.

Our file rotation is done using this python logging handler: https://docs.python.org/2/library/logging.handlers.html#rotatingfilehandler

As far as I know, we don't do anything extravagant - we are just using the above handler then set a max size and the file just gets rolled when it reaches it. The docs above explain pretty well how the handler's rolling mechanism works.

Let me know if you need any further info - will do my best to provide it.

guyboertje · May 9, 2018, 12:21pm

@ld_pvl

I want to work on this issue and the others you have open in one place.

This issue and the read-to-eof-no-delimiter-found-in-current-chunk have the same origin in my opinion. It is a bug (I think) related how the plugin reacts to file rotations in the scheme used.
The last outstanding issue you have regards the very long time it takes to move from the discover phase to the processing phase when initially discovering 500,000 files.

I will look at both of these shortly. 1 can affect other users but 2 is more applicable to your use case.
Both are extremely hard to writing failing test scenarios for though so patience is needed.

system · June 6, 2018, 12:21pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Logstash file input plugin duplicate events after log file rollover Logstash	1	618	December 11, 2019
Logstash 6.4 send duplicates when logrotate Logstash	5	511	January 4, 2019
Logstash 6.2.4 read_to_eof: no delimiter found in current chunk Logstash	11	4059	June 6, 2018
Logstash File input for file rotation Logstash	2	1213	July 6, 2017
How to prevent Logstash file input duplicate reading with rotating log? Logstash	15	1666	September 6, 2022

File input v4.1.2 tailing - duplicate docs ingested

Related topics