Process downloaded log files

Hi,

I am downloading files from S3 on a daily basis. The number of files are quite huge. approximately~ 2,50,000.
I want Logstash to process these files and send events to Elasticsearch before I encounter next download of files. i.e. next day.

I have some confusion related to the file input plugin.

  1. Should I use the "read" or "tail" mode?
  2. What is the significance of "start_position" => "beginning"? I used it and it serves the purpose. Just need to check if I am doing the right thing.
  3. How do I verify if Logstash has read and processed all the files in a particular day?

My LogStash.conf input looks something like this:-

input {
file {
path => "/local/kdmlogs/*/*/*"
max_open_files => 99999
exclude => "*.s3logs"
mode => "read"
start_position => "beginning"
ignore_older => 86400
close_older => 86400
sincedb_path => "/usr/yogehada/var/.sincedb"
sincedb_clean_after => 2.5
tags => ["logs"]
}
}

  1. tail mode is useful for tailing log files to which a process is actively writing data. If you have files that have been written out and you just want to read the entire file then use read mode.

  2. In tail mode, when log stash first sees a file, by default it seeks to the end and just listens for new data, it ignores any data already in the file. start_position => beginning tells it to read the file from the beginning. This only applies the first time logstash sees the file. In read mode it is ignored (as is close_older).

  3. In read mode, when logstash has finished reading the file, by default, it deletes it. Once all the files are deleted, they have all been read. Alternatively, if you tell it to log the file names it has read you could check that list for completeness.

1 Like

Thanks for the reply mate @Badger .
So now I am convinced that I should be using the "read" mode.

I killed logstash process and tried to start it again. But now it doesn't read any files.
I tried removing the ignore_older and still no help.
In the debug logs this is what I see:-

[2019-05-14T19:38:25,763][DEBUG][logstash.agent ] Converging pipelines state {:actions_count=>0}
[2019-05-14T19:38:26,218][DEBUG][logstash.outputs.file ] Starting stale files cleanup cycle {:files=>{}}
[2019-05-14T19:38:26,218][DEBUG][logstash.outputs.file ] 0 stale files found {:inactive_files=>{}}
[2019-05-14T19:38:26,646][DEBUG][logstash.outputs.file ] Starting flush cycle
[2019-05-14T19:38:27,370][DEBUG][logstash.instrument.periodicpoller.cgroup] One or more required cgroup files or directories not found: /proc/self/cgroup, /sys/fs/cgroup/cpuacct, /sys/fs/cgroup/cpu
[2019-05-14T19:38:27,635][DEBUG][org.logstash.execution.PeriodicFlush] Pushing flush onto pipeline.
[2019-05-14T19:38:28,134][DEBUG][logstash.instrument.periodicpoller.jvm] collector name {:name=>"ParNew"}
[2019-05-14T19:38:28,134][DEBUG][logstash.instrument.periodicpoller.jvm] collector name {:name=>"ConcurrentMarkSweep"}
[2019-05-14T19:38:28,646][DEBUG][logstash.outputs.file ] Starting flush cycle
[2019-05-14T19:38:28,762][DEBUG][logstash.config.source.local.configpathloader] Skipping the following files while reading config since they don't match the specified glob pattern {:files=>["/usr/share/logstash/CONTRIBUTORS", "/usr/share/logstash/Gemfile", "/usr/share/logstash/Gemfile.lock", "/usr/share/logstash/LICENSE.txt", "/usr/share/logstash/NOTICE.TXT", "/usr/share/logstash/bin", "/usr/share/logstash/config", "/usr/share/logstash/data", "/usr/share/logstash/lib", "/usr/share/logstash/logs", "/usr/share/logstash/logstash-core", "/usr/share/logstash/logstash-core-plugin-api", "/usr/share/logstash/modules", "/usr/share/logstash/tools", "/usr/share/logstash/vendor", "/usr/share/logstash/x-pack"]}
[2019-05-14T19:38:28,762][DEBUG][logstash.config.source.local.configpathloader] Reading config file {:config_file=>"/usr/share/logstash/LogStash.conf"}
[2019-05-14T19:38:28,764][DEBUG][logstash.agent ] Converging pipelines state {:actions_count=>0}
[2019-05-14T19:38:30,647][DEBUG][logstash.outputs.file ] Starting flush cycle
[2019-05-14T19:38:31,762][DEBUG][logstash.config.source.local.configpathloader] Skipping the following files while reading config since they don't match the specified glob pattern {:files=>["/usr/share/logstash/CONTRIBUTORS", "/usr/share/logstash/Gemfile", "/usr/share/logstash/Gemfile.lock", "/usr/share/logstash/LICENSE.txt", "/usr/share/logstash/NOTICE.TXT", "/usr/share/logstash/bin", "/usr/share/logstash/config", "/usr/share/logstash/data", "/usr/share/logstash/lib", "/usr/share/logstash/logs", "/usr/share/logstash/logstash-core", "/usr/share/logstash/logstash-core-plugin-api", "/usr/share/logstash/modules", "/usr/share/logstash/tools", "/usr/share/logstash/vendor", "/usr/share/logstash/x-pack"]}
[2019-05-14T19:38:31,762][DEBUG][logstash.config.source.local.configpathloader] Reading config file {:config_file=>"/usr/share/logstash/LogStash.conf"}
[2019-05-14T19:38:31,764][DEBUG][logstash.agent ] Converging pipelines state {:actions_count=>0}

No events are being processed and also I don't see any errors in the logs. It looks like some kind of infinite loops. Tried restarting logstash but no help.

Also, I deleted the old .sincedb file before starting Logstash.
This is how my Logstash.conf looks.

input {
file {
path => "/local/kdmlogs/*/*/*"
max_open_files => 50000
exclude => "*.s3logs"
mode => "read"
file_completed_action => "log"
file_completed_log_path => "/local/metric/files_processed"
start_position => "beginning"
#ignore_older => 86400
sincedb_path => "/usr/yogehada/var/.sincedb"
sincedb_clean_after => 2.5
tags => ["logs"]
}
}
filter {
if "logs" in [tags] {
grok {
id => "parse_log_line"
match => {
"message" => "%{YEAR:year}%{MONTHNUM:month}%{MONTHDAY:date}:%{HOUR}%{MINUTE}(%{SECOND}) %{SYSLOGPROG}:%{GREEDYDATA:syslog_message}"
}
}
grok {
id => "parse_filename"
match => {
"path" => "/local/kdmlogs/*/*/%{GREEDYDATA}#%{GREEDYDATA:device_type}#%{GREEDYDATA:software_version}#%{GREEDYDATA:dsn}#%{GREEDYDATA:s3key}"
}
}
#Remove fields we don 't want (to save space)
mutate {
remove_field => ["message", "path", "host"]
}
}
metrics {
meter => "events"
add_tag => "metric"
}
}
output {
if "grokparsefailure" not in [tags] and "logs" in [tags] {
amazon_es {
hosts => ["${HOST}"]
region => "us-east-1"
aws_access_key_id => "${ACCESSKEY}"
aws_secret_access_key => "${SECRET}"
index => "sys_logs
%{+YYYY.MM.dd}"
}
}
if "metric" in [tags] {
file {
path => "/local/metric/test-metric.txt"
codec => line {
format => "rate: %{[events][rate_1m]}"
}
}
}
}

If you have files that match the pattern "/local/kdmlogs/*/*/*" but that do not have filenames that match "*.s3logs" then you should see them being read and added to "/local/metric/files_processed".

Is it possible you meant "/local/kdmlogs/**/*" (i.e. a recursive descent)?

If you still do not get anything enable log.level trace (through either the command line, logstash.yml, or specifically for filewatch as described in this post).

Hi @Badger ,

My actual log files are present under "/local/kdmlogs/*/*/" .
There are two levels of subdirectories under /local/kdmlogs. It was working perfectly fine before this. Let me enable the trace loglevel

After enabling, trace level logging this is what I see in the logs:-

[2019-05-14T21:02:46,634][TRACE][filewatch.discoverer ] discover_files handling: {"new discovery"=>true, "watched_file details"=>"<FileWatch::WatchedFile: @filename='9223351846341672359#726#Kindle 5.11.1 (3527990035)#9804d1903e4064994ed60372c2ef4e21#4a25d%2F1557657288%2F9804d1903e4064994ed60372c2ef4e21.1557657269993:DET:UPLOAD', @state='watched', @recent_states='[:watched]', @bytes_read='0', @bytes_unread='0', current_size='333254', last_stat_size='333254', file_open?='false', @initial=true, @sincedb_key='24775306 0 64768'>"}
[2019-05-14T21:02:46,634][TRACE][filewatch.sincedbcollection] associate: finding {"inode"=>"24775306", "path"=>"/local/kdmlogs/2019051319/91/9223351846341672359#726#Kindle 5.11.1 (3527990035)#9804d1903e4064994ed60372c2ef4e21#4a25d%2F1557657288%2F9804d1903e4064994ed60372c2ef4e21.1557657269993:DET:UPLOAD"}
[2019-05-14T21:02:46,634][TRACE][filewatch.sincedbcollection] associate: unmatched
[2019-05-14T21:02:46,635][TRACE][filewatch.discoverer ] discover_files handling: {"new discovery"=>true, "watched_file details"=>"<FileWatch::WatchedFile: @filename='9223351846341684579#726#Kindle 5.11.1 (3527990035)#add089b1d9d0d50c67a262050c74cba0#18f20%2F1557652348%2Fadd089b1d9d0d50c67a262050c74cba0.1557652324692:DET:UPLOAD', @state='watched', @recent_states='[:watched]', @bytes_read='0', @bytes_unread='0', current_size='710', last_stat_size='710', file_open?='false', @initial=true, @sincedb_key='24775349 0 64768'>"}
[2019-05-14T21:02:46,635][TRACE][filewatch.sincedbcollection] associate: finding {"inode"=>"24775349", "path"=>"/local/kdmlogs/2019051319/91/9223351846341684579#726#Kindle 5.11.1 (3527990035)#add089b1d9d0d50c67a262050c74cba0#18f20%2F1557652348%2Fadd089b1d9d0d50c67a262050c74cba0.1557652324692:DET:UPLOAD"}
[2019-05-14T21:02:46,635][TRACE][filewatch.sincedbcollection] associate: unmatched
[2019-05-14T21:02:46,636][TRACE][filewatch.discoverer ] discover_files handling: {"new discovery"=>true, "watched_file details"=>"<FileWatch::WatchedFile: @filename='9223351846341704272#726#Kindle 5.11.1 (3527990035)#853ea75b6a45313932cb2a43c59f209e#49e2f%2F1557645334%2F853ea75b6a45313932cb2a43c59f209e.1557645308668:DET:UPLOAD', @state='watched', @recent_states='[:watched]', @bytes_read='0', @bytes_unread='0', current_size='227427', last_stat_size='227427', file_open?='false', @initial=true, @sincedb_key='24776898 0 64768'>"}

I am not sure what this actually means.
I presume it discovers the file and looks it in the sincedb collection, doesn't find it and calls it unmatched.

Hi @Badger, did you find time to look at the log above?

Thanks.

I did but I do not understand the filewatch code well enough to interpret it. It is finding the files and knows they are not zero length, but does not appear to start reading them. I do not understand how that could happen.

An update on this. After around 15 hours, logstash started reading some files.
However the event rate that I captured from the "metrics" plugin was too low ~2000 events/sec. I usually get an average of 25,0000 events/sec.

It stopped again after an hour. It didn't read all the files from the subdirectory.

Unfortunately there is not enough components at TRACE level logging.

I need as many filewatch.readmode.* components as possible logging at TRACE.

What is the output of curl -XGET 'localhost:9600/_node/logging?pretty'?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.