Restarting logstash container sends events again to elastic, despite sincedb

Hi all,

Can you help us ?

We use elk 7.17.7 in docker containers hosted on a server with persistent shared volumes for path (read only), file_completed_log_path, sincedb_path.

Each time we stop and start our logstash container, all our files are read again and events sent again to elastic.

Here is our sincedb (there is 4 files in /tmp/access_logs/):

cat /var/tmp/sincedb 
271957 0 2305 3957761 1675437232.355527 /tmp/access_logs/updates.access.log.2022-01-05.gz
5769685 0 2305 424121 1675437236.113389
5769687 0 2305 356064 1675437237.877867
6555382 0 2305 4623717 1675437248.771921

At each restart, we see the timestamp updated. inode, devices and bytes read don't change.

Here is our config:

input {
    file {
         path => "/tmp/access_logs/*.gz"
         mode => "read"
         file_completed_action => "log"
         file_completed_log_path => "/var/tmp/imported_access_log_archives.log"
         sincedb_path => "/var/tmp/sincedb"
         start_position => "beginning"

filter {
<<<removed private content>>>

output {
	elasticsearch {
		hosts => "elasticsearch:9200"
		user => "elastic"
		password => "<<<our-password>>>"
		index => "logstash-nginx"

Thanks for your help!

You should use: mode => "tail"
Most likely /var/tmp is clean on restart. Avoid Linux reserved directories, use your own with logstash:logstash permissions or leave sincedb_path emty to use default path.

Thank you Rios!
Just tested mode => "tail", unsuccessfully. Probably because we use gzip files, which seems to be only supported by mode => "read".
When we stop and start the logstash container, since_db is persisted. Imported access log too.

Then you can implement unique id values with fingerprint. Check this

1 Like

Thank you for this link Rios. Am I wrong if I consider this as a workaround, as the files are still read multiple times? As there already is data on Elasticsearch, I would like to avoid recreating indices/fingerprint if that is possible.

After inspection of logs at trace level there is this interesting message : associate: matched but allocated to another, which seems to indicate that logstash matches the inode but not the file path and so reads the file again. Here is the code generating this log on github.

Here are the log entries below, filtered to keep the lines related to only one file (behaviour is same for all 4 of them).

[2023-02-06T15:55:57,189][TRACE][filewatch.discoverer     ][main][c2afb52b5a6598556fbbd090c79da9d9bb75d8cdf3f0d9ce66d4f16c163e157d] discover_files {:count=>4}
[2023-02-06T15:55:57,238][TRACE][filewatch.discoverer     ][main][c2afb52b5a6598556fbbd090c79da9d9bb75d8cdf3f0d9ce66d4f16c163e157d] handling: {:new_discovery=>true, :watched_file=>"<FileWatch::WatchedFile: @filename='updates.access.log.2023-01-30.gz', @state=:watched, @recent_states=[:watched], @bytes_read=0, @bytes_unread=0, current_size=424121, last_stat_size=424121, file_open?=false, @initial=true, sincedb_key='5769685 0 2305'>"}
[2023-02-06T15:55:57,239][TRACE][filewatch.sincedbcollection][main][c2afb52b5a6598556fbbd090c79da9d9bb75d8cdf3f0d9ce66d4f16c163e157d] associate: finding {:path=>"/tmp/access_logs/updates.access.log.2023-01-30.gz", :inode=>"5769685"}
[2023-02-06T15:55:57,239][TRACE][filewatch.sincedbcollection][main][c2afb52b5a6598556fbbd090c79da9d9bb75d8cdf3f0d9ce66d4f16c163e157d] associate: found sincedb record {:filename=>"updates.access.log.2023-01-30.gz", :sincedb_key=>#<struct FileWatch::InodeStruct inode="5769685", maj=0, min=2305>, :sincedb_value=>#<FileWatch::SincedbValue:0x5fdca3ba @last_changed_at=1675698022.247521, @path_in_sincedb="", @watched_file=nil, @position=424121>}
[2023-02-06T15:55:57,240][TRACE][filewatch.sincedbcollection][main][c2afb52b5a6598556fbbd090c79da9d9bb75d8cdf3f0d9ce66d4f16c163e157d] associate: matched but allocated to another {:filename=>"updates.access.log.2023-01-30.gz"}

[2023-02-06T15:56:22,514][TRACE][filewatch.readmode.handlers.readzipfile][main][c2afb52b5a6598556fbbd090c79da9d9bb75d8cdf3f0d9ce66d4f16c163e157d] handling: {:path=>"/tmp/access_logs/updates.access.log.2023-01-30.gz"}
[2023-02-06T15:56:22,515][TRACE][filewatch.readmode.handlers.readzipfile][main][c2afb52b5a6598556fbbd090c79da9d9bb75d8cdf3f0d9ce66d4f16c163e157d] add_new_value_sincedb_collection: {:path=>"/tmp/access_logs/updates.access.log.2023-01-30.gz", :position=>0}

[2023-02-06T15:56:45,907][TRACE][filewatch.sincedbcollection][main][c2afb52b5a6598556fbbd090c79da9d9bb75d8cdf3f0d9ce66d4f16c163e157d] sincedb_write: /var/tmp/sincedb (time = 2023-02-06 15:56:45 +0000)
[2023-02-06T15:56:45,907][TRACE][filewatch.sincedbcollection][main][c2afb52b5a6598556fbbd090c79da9d9bb75d8cdf3f0d9ce66d4f16c163e157d] non_atomic_write:  {:time=>2023-02-06 15:56:45 +0000}

That appears to be hitting something similar to this use case. Making the distinction between rotation and inode reuse without doing a checksum of the file each time it is read is probably impossible. The file input uses heuristics that almost always get it right for some use cases (like daily log files on a busy filesystem). This is not one of them.

1 Like

Thank you Badger. In the usecase you provided, the filename is changed.
What bothered me is that in our case, neither the filename or the inode changed.

We decided to add a fingerprint filter and to create the document_id on logstash side. That solves our issue, even if it looks like a workaround.

Filter section:

filter {
    fingerprint {
        source => ["a_timestamp", "some_other_field" ]
        target => "[@metadata][generated_id]"
        method => "SHA256"
        key => "some_random_key"
        concatenate_sources => true

Output section:

output {
    elasticsearch {
        document_id => "%{[@metadata][generated_id]}"

Note that a key is not required for SHA256. The documentation may suggest it is, but the code knows to create a digest if it does not have a key that allows it to create an HMAC.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.