I have Logstash 7.6 running on Ubuntu 20.04. I am importing a pretty simple CSV file. The data in the file is separated by commas. See below:
email,send_time,engaged,last_touch_utc,score,frequency,open_count,click_count,codes
VlVSUlNESndNa1F4YmtKYWRWZGFVZz09/2wlyJIaCuyb/DLM16QSCRXIsoig6YQpWjR479tY8vM=,yahoo.com,86,1,2020-08-26 07:51:19 PM,50,0.5,3,6,448140|446120|454113
VVhnMWFVaFhNbWRRU21oa09HazRaQT098M+44dupuzateOzZgtW1siqZcCn97z5P30cDiljVA54=,gmail.com,86,1,2020-10-04 04:36:33 PM,41,5,112,19,
Linecount of the import has about 1 million records.
The config file is pretty simple:
input {
tcp {
port => 5000
}
file {
path => '/var/ao/FileDrop/ecsv/*.csv'
mode => 'read'
start_position => 'beginning'
file_completed_action => 'delete'
}
}
filter {
csv {
columns => ['email','domain','send_time','engaged','goobrt','score','frequency','open_count','click_count','codes']
separator => ','
skip_header=> "true"
}
# if( [goobrt] != "n/a") {
date {
match => ["goobrt", "yyyy-M-d H:mm:ss a"]
timezone => "UTC"
target => "last_touch_utc"
}
# }
mutate {
gsub=> [ "path","^.*\/","" ]
gsub=> [ "path",".out.csv","" ]
}
mutate {
copy => { "path" => "[@metadata][indexName]" }
}
ruby {
code => "
wanted_fields = ['path','email','send_time','engaged','last_touch_utc','score','domain','frequency','open_count','click_count','codes']
event.to_hash.keys.each { |k|
event.remove(k) unless wanted_fields.include? k
}
"
}
}
output {
elasticsearch {
hosts => 'localhost:9200'
# index => "ecsv"
index=> "ecsv-%{[@metadata][indexName]}"
}
}
I have imported it in debug mode, which didn't yield a lot of information. I had a lot of these lines:
[DEBUG] 2020-10-10 18:48:01.639 [[main]>worker1] csv - Running csv filter {:event=>#<LogStash::Event:0x6bac85b6>}
[DEBUG] 2020-10-10 18:48:01.639 [[main]>worker1] csv - Event after csv filter {:event=>#<LogStash::Event:0x6bac85b6>}
Which after looking at the ruby code for the filter is not concerning at all. I also had these lines in the file.
[DEBUG] 2020-10-10 18:48:01.510 [[main]<file] file - Received line {:path=>"/var/ao/FileDrop/ecsv/3f0faf5f-1c1a-0a73-4751-2b266073e136.out.csv", :text=>"TlZWbU1VcDRUREk1YWpWcmRrVlBhUT0981FXHh6MObMq+4B85WRP0j+Wze04F+gAjMa+rZ0wCr0=,outlook.com,86,1,2020-06-23 01:57:01 AM,45,0.7,219,12,443142|446120|511210|448120\r"}
Nothing in the log indicated that records were being dropped. There were no errors. But imports a variable number of records each time with the same file.
Import 1: 440,668
Import 2: 211,840
Import 3: 417,153
Import 4: 740,142
I have tried it with different files arriving at the same out come with messages are being dropped. This is a single node instance. There are 8 cores. I have tried varying 1-8 workers. I have adjusted the batch size, but still messages are being dropped.
- The input files are generated in our analytics cluster and as such the format is controlled.
- The disk is 50% full.
- 500-ish indices on server
I AM STUMPED and about 24 hours into debugging this thing. I have tried everything I know to check.
Thought: The last field of the CSV is call "codes" sometimes it has values, sometimes it does not. There does not seem to be a correlation to the records that are kept/dropped.
Help.