Restarting logstash container sends events again to elastic, despite sincedb

paul_chrlt · February 3, 2023, 4:35pm

Hi all,

Can you help us ?

We use elk 7.17.7 in docker containers hosted on a server with persistent shared volumes for path (read only), file_completed_log_path, sincedb_path.

Each time we stop and start our logstash container, all our files are read again and events sent again to elastic.

Here is our sincedb (there is 4 files in /tmp/access_logs/):

cat /var/tmp/sincedb 
271957 0 2305 3957761 1675437232.355527 /tmp/access_logs/updates.access.log.2022-01-05.gz
5769685 0 2305 424121 1675437236.113389
5769687 0 2305 356064 1675437237.877867
6555382 0 2305 4623717 1675437248.771921

At each restart, we see the timestamp updated. inode, devices and bytes read don't change.

Here is our config:

input {
 stdin{}
    file {
         path => "/tmp/access_logs/*.gz"
         mode => "read"
         file_completed_action => "log"
         file_completed_log_path => "/var/tmp/imported_access_log_archives.log"
         sincedb_path => "/var/tmp/sincedb"
         start_position => "beginning"
    }
}

filter {
<<<removed private content>>>
}

output {
	elasticsearch {
		hosts => "elasticsearch:9200"
		user => "elastic"
		password => "<<<our-password>>>"
		index => "logstash-nginx"
	}
}

Thanks for your help!

Rios · February 3, 2023, 9:29pm

You should use: mode => "tail"
Most likely /var/tmp is clean on restart. Avoid Linux reserved directories, use your own with logstash:logstash permissions or leave sincedb_path emty to use default path.

paul_chrlt · February 6, 2023, 2:41pm

Thank you Rios!
Just tested mode => "tail", unsuccessfully. Probably because we use gzip files, which seems to be only supported by mode => "read".
When we stop and start the logstash container, since_db is persisted. Imported access log too.

Rios · February 6, 2023, 4:42pm

Then you can implement unique id values with fingerprint. Check this

paul_chrlt · February 6, 2023, 5:35pm

Thank you for this link Rios. Am I wrong if I consider this as a workaround, as the files are still read multiple times? As there already is data on Elasticsearch, I would like to avoid recreating indices/fingerprint if that is possible.

After inspection of logs at trace level there is this interesting message : associate: matched but allocated to another, which seems to indicate that logstash matches the inode but not the file path and so reads the file again. Here is the code generating this log on github.

Here are the log entries below, filtered to keep the lines related to only one file (behaviour is same for all 4 of them).

[2023-02-06T15:55:57,189][TRACE][filewatch.discoverer     ][main][c2afb52b5a6598556fbbd090c79da9d9bb75d8cdf3f0d9ce66d4f16c163e157d] discover_files {:count=>4}
[2023-02-06T15:55:57,238][TRACE][filewatch.discoverer     ][main][c2afb52b5a6598556fbbd090c79da9d9bb75d8cdf3f0d9ce66d4f16c163e157d] handling: {:new_discovery=>true, :watched_file=>"<FileWatch::WatchedFile: @filename='updates.access.log.2023-01-30.gz', @state=:watched, @recent_states=[:watched], @bytes_read=0, @bytes_unread=0, current_size=424121, last_stat_size=424121, file_open?=false, @initial=true, sincedb_key='5769685 0 2305'>"}
[2023-02-06T15:55:57,239][TRACE][filewatch.sincedbcollection][main][c2afb52b5a6598556fbbd090c79da9d9bb75d8cdf3f0d9ce66d4f16c163e157d] associate: finding {:path=>"/tmp/access_logs/updates.access.log.2023-01-30.gz", :inode=>"5769685"}
[2023-02-06T15:55:57,239][TRACE][filewatch.sincedbcollection][main][c2afb52b5a6598556fbbd090c79da9d9bb75d8cdf3f0d9ce66d4f16c163e157d] associate: found sincedb record {:filename=>"updates.access.log.2023-01-30.gz", :sincedb_key=>#<struct FileWatch::InodeStruct inode="5769685", maj=0, min=2305>, :sincedb_value=>#<FileWatch::SincedbValue:0x5fdca3ba @last_changed_at=1675698022.247521, @path_in_sincedb="", @watched_file=nil, @position=424121>}
[2023-02-06T15:55:57,240][TRACE][filewatch.sincedbcollection][main][c2afb52b5a6598556fbbd090c79da9d9bb75d8cdf3f0d9ce66d4f16c163e157d] associate: matched but allocated to another {:filename=>"updates.access.log.2023-01-30.gz"}

[2023-02-06T15:56:22,514][TRACE][filewatch.readmode.handlers.readzipfile][main][c2afb52b5a6598556fbbd090c79da9d9bb75d8cdf3f0d9ce66d4f16c163e157d] handling: {:path=>"/tmp/access_logs/updates.access.log.2023-01-30.gz"}
[2023-02-06T15:56:22,515][TRACE][filewatch.readmode.handlers.readzipfile][main][c2afb52b5a6598556fbbd090c79da9d9bb75d8cdf3f0d9ce66d4f16c163e157d] add_new_value_sincedb_collection: {:path=>"/tmp/access_logs/updates.access.log.2023-01-30.gz", :position=>0}

[2023-02-06T15:56:45,907][TRACE][filewatch.sincedbcollection][main][c2afb52b5a6598556fbbd090c79da9d9bb75d8cdf3f0d9ce66d4f16c163e157d] sincedb_write: /var/tmp/sincedb (time = 2023-02-06 15:56:45 +0000)
[2023-02-06T15:56:45,907][TRACE][filewatch.sincedbcollection][main][c2afb52b5a6598556fbbd090c79da9d9bb75d8cdf3f0d9ce66d4f16c163e157d] non_atomic_write:  {:time=>2023-02-06 15:56:45 +0000}

Badger · February 6, 2023, 5:50pm

That appears to be hitting something similar to this use case. Making the distinction between rotation and inode reuse without doing a checksum of the file each time it is read is probably impossible. The file input uses heuristics that almost always get it right for some use cases (like daily log files on a busy filesystem). This is not one of them.

paul_chrlt · February 14, 2023, 4:10pm

Thank you Badger. In the usecase you provided, the filename is changed.
What bothered me is that in our case, neither the filename or the inode changed.

We decided to add a fingerprint filter and to create the document_id on logstash side. That solves our issue, even if it looks like a workaround.

Filter section:

filter {
    fingerprint {
        source => ["a_timestamp", "some_other_field" ]
        target => "[@metadata][generated_id]"
        method => "SHA256"
        key => "some_random_key"
        concatenate_sources => true
    }
}

Output section:

output {
    elasticsearch {
        document_id => "%{[@metadata][generated_id]}"
    }
}

Badger · February 14, 2023, 6:03pm

Note that a key is not required for SHA256. The documentation may suggest it is, but the code knows to create a digest if it does not have a key that allows it to create an HMAC.

system · March 14, 2023, 6:03pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Logstash on docker: configure sincedb file to use the last indexed position whenever container restarts/new Logstash docker	12	1862	August 14, 2020
Logstash file read , mode read, sincedb, duplicate events after restart logstash Logstash	3	857	December 29, 2020
Logstash sincedb file issue Logstash	1	488	September 17, 2020
Logstash read the entire file again whenever pipeline reload Logstash	13	1380	April 10, 2020
After restart, logstash doesn't read only new files Logstash	7	972	January 13, 2022

Restarting logstash container sends events again to elastic, despite sincedb

Related topics