Process downloaded log files

Yogendra_Hada · May 14, 2019, 8:12am

Hi,

I am downloading files from S3 on a daily basis. The number of files are quite huge. approximately~ 2,50,000.
I want Logstash to process these files and send events to Elasticsearch before I encounter next download of files. i.e. next day.

I have some confusion related to the file input plugin.

Should I use the "read" or "tail" mode?
What is the significance of "start_position" => "beginning"? I used it and it serves the purpose. Just need to check if I am doing the right thing.
How do I verify if Logstash has read and processed all the files in a particular day?

My LogStash.conf input looks something like this:-

input {
file {
path => "/local/kdmlogs/*/*/*"
max_open_files => 99999
exclude => "*.s3logs"
mode => "read"
start_position => "beginning"
ignore_older => 86400
close_older => 86400
sincedb_path => "/usr/yogehada/var/.sincedb"
sincedb_clean_after => 2.5
tags => ["logs"]
}
}

Badger · May 14, 2019, 2:21pm

tail mode is useful for tailing log files to which a process is actively writing data. If you have files that have been written out and you just want to read the entire file then use read mode.
In tail mode, when log stash first sees a file, by default it seeks to the end and just listens for new data, it ignores any data already in the file. start_position => beginning tells it to read the file from the beginning. This only applies the first time logstash sees the file. In read mode it is ignored (as is close_older).
In read mode, when logstash has finished reading the file, by default, it deletes it. Once all the files are deleted, they have all been read. Alternatively, if you tell it to log the file names it has read you could check that list for completeness.

Yogendra_Hada · May 14, 2019, 7:43pm

Thanks for the reply mate @Badger .
So now I am convinced that I should be using the "read" mode.

I killed logstash process and tried to start it again. But now it doesn't read any files.
I tried removing the ignore_older and still no help.
In the debug logs this is what I see:-

[2019-05-14T19:38:25,763][DEBUG][logstash.agent ] Converging pipelines state {:actions_count=>0}
[2019-05-14T19:38:26,218][DEBUG][logstash.outputs.file ] Starting stale files cleanup cycle {:files=>{}}
[2019-05-14T19:38:26,218][DEBUG][logstash.outputs.file ] 0 stale files found {:inactive_files=>{}}
[2019-05-14T19:38:26,646][DEBUG][logstash.outputs.file ] Starting flush cycle
[2019-05-14T19:38:27,370][DEBUG][logstash.instrument.periodicpoller.cgroup] One or more required cgroup files or directories not found: /proc/self/cgroup, /sys/fs/cgroup/cpuacct, /sys/fs/cgroup/cpu
[2019-05-14T19:38:27,635][DEBUG][org.logstash.execution.PeriodicFlush] Pushing flush onto pipeline.
[2019-05-14T19:38:28,134][DEBUG][logstash.instrument.periodicpoller.jvm] collector name {:name=>"ParNew"}
[2019-05-14T19:38:28,134][DEBUG][logstash.instrument.periodicpoller.jvm] collector name {:name=>"ConcurrentMarkSweep"}
[2019-05-14T19:38:28,646][DEBUG][logstash.outputs.file ] Starting flush cycle
[2019-05-14T19:38:28,762][DEBUG][logstash.config.source.local.configpathloader] Skipping the following files while reading config since they don't match the specified glob pattern {:files=>["/usr/share/logstash/CONTRIBUTORS", "/usr/share/logstash/Gemfile", "/usr/share/logstash/Gemfile.lock", "/usr/share/logstash/LICENSE.txt", "/usr/share/logstash/NOTICE.TXT", "/usr/share/logstash/bin", "/usr/share/logstash/config", "/usr/share/logstash/data", "/usr/share/logstash/lib", "/usr/share/logstash/logs", "/usr/share/logstash/logstash-core", "/usr/share/logstash/logstash-core-plugin-api", "/usr/share/logstash/modules", "/usr/share/logstash/tools", "/usr/share/logstash/vendor", "/usr/share/logstash/x-pack"]}
[2019-05-14T19:38:28,762][DEBUG][logstash.config.source.local.configpathloader] Reading config file {:config_file=>"/usr/share/logstash/LogStash.conf"}
[2019-05-14T19:38:28,764][DEBUG][logstash.agent ] Converging pipelines state {:actions_count=>0}
[2019-05-14T19:38:30,647][DEBUG][logstash.outputs.file ] Starting flush cycle
[2019-05-14T19:38:31,762][DEBUG][logstash.config.source.local.configpathloader] Skipping the following files while reading config since they don't match the specified glob pattern {:files=>["/usr/share/logstash/CONTRIBUTORS", "/usr/share/logstash/Gemfile", "/usr/share/logstash/Gemfile.lock", "/usr/share/logstash/LICENSE.txt", "/usr/share/logstash/NOTICE.TXT", "/usr/share/logstash/bin", "/usr/share/logstash/config", "/usr/share/logstash/data", "/usr/share/logstash/lib", "/usr/share/logstash/logs", "/usr/share/logstash/logstash-core", "/usr/share/logstash/logstash-core-plugin-api", "/usr/share/logstash/modules", "/usr/share/logstash/tools", "/usr/share/logstash/vendor", "/usr/share/logstash/x-pack"]}
[2019-05-14T19:38:31,762][DEBUG][logstash.config.source.local.configpathloader] Reading config file {:config_file=>"/usr/share/logstash/LogStash.conf"}
[2019-05-14T19:38:31,764][DEBUG][logstash.agent ] Converging pipelines state {:actions_count=>0}

No events are being processed and also I don't see any errors in the logs. It looks like some kind of infinite loops. Tried restarting logstash but no help.

Yogendra_Hada · May 14, 2019, 7:57pm

Also, I deleted the old .sincedb file before starting Logstash.
This is how my Logstash.conf looks.

input {
file {
path => "/local/kdmlogs/*/*/*"
max_open_files => 50000
exclude => "*.s3logs"
mode => "read"
file_completed_action => "log"
file_completed_log_path => "/local/metric/files_processed"
start_position => "beginning"
#ignore_older => 86400
sincedb_path => "/usr/yogehada/var/.sincedb"
sincedb_clean_after => 2.5
tags => ["logs"]
}
}
filter {
if "logs" in [tags] {
grok {
id => "parse_log_line"
match => {
"message" => "%{YEAR:year}%{MONTHNUM:month}%{MONTHDAY:date}:%{HOUR}%{MINUTE}(%{SECOND}) %{SYSLOGPROG}:%{GREEDYDATA:syslog_message}"
}
}
grok {
id => "parse_filename"
match => {
"path" => "/local/kdmlogs/*/*/%{GREEDYDATA}#%{GREEDYDATA:device_type}#%{GREEDYDATA:software_version}#%{GREEDYDATA:dsn}#%{GREEDYDATA:s3key}"
}
}
#Remove fields we don 't want (to save space)
mutate {
remove_field => ["message", "path", "host"]
}
}
metrics {
meter => "events"
add_tag => "metric"
}
}
output {
if "grokparsefailure" not in [tags] and "logs" in [tags] {
amazon_es {
hosts => ["${HOST}"]
region => "us-east-1"
aws_access_key_id => "${ACCESSKEY}"
aws_secret_access_key => "${SECRET}"
index => "sys_logs%{+YYYY.MM.dd}"
}
}
if "metric" in [tags] {
file {
path => "/local/metric/test-metric.txt"
codec => line {
format => "rate: %{[events][rate_1m]}"
}
}
}
}

Badger · May 14, 2019, 8:44pm

If you have files that match the pattern "/local/kdmlogs/*/*/*" but that do not have filenames that match "*.s3logs" then you should see them being read and added to "/local/metric/files_processed".

Is it possible you meant "/local/kdmlogs/**/*" (i.e. a recursive descent)?

If you still do not get anything enable log.level trace (through either the command line, logstash.yml, or specifically for filewatch as described in this post).

Yogendra_Hada · May 14, 2019, 8:51pm

Hi @Badger ,

My actual log files are present under "/local/kdmlogs/*/*/" .
There are two levels of subdirectories under /local/kdmlogs. It was working perfectly fine before this. Let me enable the trace loglevel

Yogendra_Hada · May 14, 2019, 9:48pm

After enabling, trace level logging this is what I see in the logs:-

[2019-05-14T21:02:46,634][TRACE][filewatch.discoverer ] discover_files handling: {"new discovery"=>true, "watched_file details"=>"<FileWatch::WatchedFile: @filename='9223351846341672359#726#Kindle 5.11.1 (3527990035)#9804d1903e4064994ed60372c2ef4e21#4a25d%2F1557657288%2F9804d1903e4064994ed60372c2ef4e21.1557657269993:DET:UPLOAD', @state='watched', @recent_states='[:watched]', @bytes_read='0', @bytes_unread='0', current_size='333254', last_stat_size='333254', file_open?='false', @initial=true, @sincedb_key='24775306 0 64768'>"}
[2019-05-14T21:02:46,634][TRACE][filewatch.sincedbcollection] associate: finding {"inode"=>"24775306", "path"=>"/local/kdmlogs/2019051319/91/9223351846341672359#726#Kindle 5.11.1 (3527990035)#9804d1903e4064994ed60372c2ef4e21#4a25d%2F1557657288%2F9804d1903e4064994ed60372c2ef4e21.1557657269993:DET:UPLOAD"}
[2019-05-14T21:02:46,634][TRACE][filewatch.sincedbcollection] associate: unmatched
[2019-05-14T21:02:46,635][TRACE][filewatch.discoverer ] discover_files handling: {"new discovery"=>true, "watched_file details"=>"<FileWatch::WatchedFile: @filename='9223351846341684579#726#Kindle 5.11.1 (3527990035)#add089b1d9d0d50c67a262050c74cba0#18f20%2F1557652348%2Fadd089b1d9d0d50c67a262050c74cba0.1557652324692:DET:UPLOAD', @state='watched', @recent_states='[:watched]', @bytes_read='0', @bytes_unread='0', current_size='710', last_stat_size='710', file_open?='false', @initial=true, @sincedb_key='24775349 0 64768'>"}
[2019-05-14T21:02:46,635][TRACE][filewatch.sincedbcollection] associate: finding {"inode"=>"24775349", "path"=>"/local/kdmlogs/2019051319/91/9223351846341684579#726#Kindle 5.11.1 (3527990035)#add089b1d9d0d50c67a262050c74cba0#18f20%2F1557652348%2Fadd089b1d9d0d50c67a262050c74cba0.1557652324692:DET:UPLOAD"}
[2019-05-14T21:02:46,635][TRACE][filewatch.sincedbcollection] associate: unmatched
[2019-05-14T21:02:46,636][TRACE][filewatch.discoverer ] discover_files handling: {"new discovery"=>true, "watched_file details"=>"<FileWatch::WatchedFile: @filename='9223351846341704272#726#Kindle 5.11.1 (3527990035)#853ea75b6a45313932cb2a43c59f209e#49e2f%2F1557645334%2F853ea75b6a45313932cb2a43c59f209e.1557645308668:DET:UPLOAD', @state='watched', @recent_states='[:watched]', @bytes_read='0', @bytes_unread='0', current_size='227427', last_stat_size='227427', file_open?='false', @initial=true, @sincedb_key='24776898 0 64768'>"}

I am not sure what this actually means.
I presume it discovers the file and looks it in the sincedb collection, doesn't find it and calls it unmatched.

Yogendra_Hada · May 15, 2019, 6:09am

Hi @Badger, did you find time to look at the log above?

Thanks.

Badger · May 15, 2019, 2:27pm

I did but I do not understand the filewatch code well enough to interpret it. It is finding the files and knows they are not zero length, but does not appear to start reading them. I do not understand how that could happen.

Yogendra_Hada · May 15, 2019, 5:29pm

An update on this. After around 15 hours, logstash started reading some files.
However the event rate that I captured from the "metrics" plugin was too low ~2000 events/sec. I usually get an average of 25,0000 events/sec.

It stopped again after an hour. It didn't read all the files from the subdirectory.

guyboertje · May 23, 2019, 6:09pm

Unfortunately there is not enough components at TRACE level logging.

I need as many filewatch.readmode.* components as possible logging at TRACE.

What is the output of curl -XGET 'localhost:9600/_node/logging?pretty'?

system · June 20, 2019, 6:09pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Logstash not reading from file input Logstash	17	10757	February 26, 2019
Logstash not reading more than 2 files Logstash	7	1413	August 31, 2017
Logstash not ingesting when new file added to the directory Logstash	17	2677	July 6, 2017
File input not working - logstash 2.2.2 Logstash	5	1357	July 6, 2017
File input plugin always leaving files behind Logstash	9	2036	July 23, 2019

Process downloaded log files

Related topics