Logstash with GlusterFS, sincedb issue

Hello!

I am running into issue with (unscheduled) reimport of data over logstash.

I am currently deploying Logstash (version 5.1.1) via Docker. It is taking files from filesystem (GlusterFS is used, 4 replicas) and importing them to Elasticsearch cluster (version 5.6.1).

Issue that I am running into is that with redeployment of Logstash container, data is sometimes reimported. Reason behind it seems change in "Minor device number" in .sincedb file.

There are couple of things that I have checked:

  • .sincedb is accessible by Logstash
  • all environmental variables, used in configuration, are set up properly
  • .sincedb file is populated with inode, major/minor device number and last read byte
  • inode and major/minor device numbers do not change between redeployment of container on file system
  • issue occurs even if container is deployed on same node (was considering that GlusterFS would not present coherent inode/major or minor device number, but instead for each node separate)
  • issue is inconsistent, sometimes when I deploy the container, it is working as intended, at other time, it starts the reimport

What I also noticed is that inode does not match inode that I recieve from FS (ls -l or stat) (everytime reimport occurs, same inodes are used though). Here there's subquestion - does .sincedb use exact inodes or does use some logic to change number (e.g. divide by 2)? Since inode in .sincedb is 19 digits long, while in FS it is 20 digits.

This is example of .sincedb, when 1st reimport occured:

-7473865163650054899 0 111 8438672
-5847217082550397153 0 111 9376555
-5713982512048495209 0 111 9119366
-7473865163650054899 0 117 8438672
-5847217082550397153 0 117 9376555
-5713982512048495209 0 117 9119366

And this is stat from one of the files:

File: ‘Test_2017-11-11.csv’
Size: 8106106 Blocks: 15833 IO Block: 131072 regular file
Device: 26h/38d Inode: 9448843599933928834 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2017-11-28 09:31:33.993351555 +0100
Modify: 2017-11-11 10:00:30.041572266 +0100
Change: 2017-11-21 14:23:45.197149371 +0100

Below is also sample configuration:

input {
file {
path => "${IMPORT_PATH}/TestFile*"
sincedb_path => "${SINCEDB_PATH}/.sincedb_test"
start_position => "beginning"
}
}

filter {

csv {
separator => ","
quote_char => '"'
columns => ["documentId","test1", "test2", "StartTimestamp", "input1", "input2", "ip"]
}

if (["test1"] == "test1") {
drop{}
}

date {
# 2015-12-01 07:40:12
match => ["StartTimestamp", "yyyy-MM-dd HH:mm:ss"]
target => '@timestamp'
}

date {
# 2015-12-01 07:40:12
match => ["StartTimestamp", "yyyy-MM-dd HH:mm:ss"]
target => 'StartTimestamp'
}

geoip {
source => "ip"
target => "geoip"
database => "/glusterfs/logstash/config/GeoLite2-City.mmdb"
add_field => ["[geoip][coordinates]", "%{[geoip][longitude]}"]
add_field => ["[geoip][coordinates]", "%{[geoip][latitude]}"]
}

mutate {
convert => ["[geoip][coordinates]", "float"]
convert => ["input2", "integer"]
add_field => ["Added_field", "${TEST_ENV}"]
}

ruby {
code => "
hash = event.to_hash
hash.each do |k,v|
if v == 'NULL'
event.set(k, nil)
end
end
"
}
}

output {
stdout { codec => rubydebug }
elasticsearch {
hosts => ["host1"]
ssl => true
cacert => "/glusterfs/logstash/config/certificates/certificate.pem"
index => "${INDEX_PREFIX}-test-%{+YYYY.MM}"
document_id => "%{documentId}"
template => "${MAPPING_PATH}/test_mapping.json"
template_name => "test_mapping"
template_overwrite => true
user => "${ELASTICSEARCH_USERNAME}"
password => "${ELASTICSEARCH_PASSWORD}"
}
}

For mapping file, it has number of shards set to 3 and refresh_interval = 5s.

What I noticed though is that document count remains the same in index after reimport, it seems that documents are updated (add + delete procedure):

health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open zone1_test-2017.11 gcGPsIluTN-H2djXYOjkYA 3 1 190842 0 737.2mb 368.8mb

health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open zone1_test-2017.11 gcGPsIluTN-H2djXYOjkYA 3 1 190842 24507 854.3mb 413.4mb

So basically, I do not understand why Logstash finds different minor device numbers for files, eventhough on FS they remain the same. At the same time, can configuration be set in the way that minor device number is not used? E.g. in case of server restart, I presume number would change, causing reimport of data again.

Thank you and best regards,
Bostjan

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.