I am parsing some sysstat (Linux sar commands) files. They sysstat files start with a header of the Linux kernel, hostname, date, and architecture. I was able to parse out what I wanted with this grok filter:
After that I I can collect the time of entry and the stats reported with this grok filter:
match => {
"message" => "%{TIME:stat_time}\s+%{NUMBER:stat_b_tps}\s+%{NUMBER:stat_b_rtps}\s+%{NUMBER:stat_b_wtps}\s+%{NUMBER:stat_b_bread}\s+%{NUMBER:stat_b_bwrtn}%{GREEDYDATA:remaining_stat3_message}"
}
What I can't figure out how to do is build a timestamp for each entry since the date and time are on separate lines and the date is only listed once. Can I store the data in some sort of variable to reference later?
For now I just used python to set environment variables which get from reading the file with python, then the logstash reads the environmental variable.
So you don't really need python. I just already had a python script opening a tar file from a customer's system.
python method:
with open("path_to_file_with_header", 'r') as sysstat_file:
first_line = sysstat_file.readline()
# grabs the 4th column, as it starts counting at 0
sysstat_date = first_line.split()[3]
os.environ["LOGSTASH_SYSSTAT_DATE"] = sysstat_date
# print to see the date is what we thought.
print os.environ["LOGSTASH_SYSSTAT_DATE"]
bash method:
# grabs the 4th column as 0 is the entire string and 1 is the first column
export LOGSTASH_SYSSTAT_DATE="$(head -n 1 path_to_file_with_header | awk '{print $4}')"
# echo to see that it outputs as expected
echo $LOGSTASH_SYSSTAT_DATE
Within the logstash.conf file do like this:
input {
stdin {
tags => [ "${LOGSTASH_TAG:null}"]
}
}
filter {
grok { match {"message" => "%{TIME:stat_time}%{GREEDYDATA:remaining_stat_message}"}}
mutate {
add_field => {
"sys_log_timestamp" => "${LOGSTASH_SYSSTAT_DATE:null} %{stat_time}"
}
date {
match => [ "sys_log_timestamp",
"MM/dd/yy HH:mm:ss"]
target => "@timestamp"
}
# If the date can't get parsed, I don't want it in my elastic search, so I just drop it.
if "_dateparsefailure" in [tags] {
drop{}
}
}
output {
stdout {
codec => rubydebug
}
# elasticsearch {
# hosts => ["127.0.0.1:9200"]
# index => "${LOGSTASH_NAME:customer_investigation}"
# }
}
Oh and lastly to enable reading a file once through as well as getting the environmental variables to show in logstash, run it like so:
This still feels like the 'wrong' way to solve this problem, or at least not very proper ELK toolset. I would even prefer being able to set the environmental variable with logstash. I saw some solutions beforehand that included some ruby code within the logstash file that set a ruby variable, but I haven't dealt with much ruby and wasn't following how to repurpose their snippets for my scenario.
Ah I see what you mean. Likewise I wanted to avoid doing any pre-processing because I feel that logstash should be able to manage to process arbitrary input somehow. If it can't then there's something wrong with its design.
My own issue is here:
It seems to me that logstash struggles with anything that isn't a simple predictable log format of
timestamp followed by some data
but even my situation does have that format, just not quite in the way logstash wants it.
Thanks for the info, I'll look into using a cronned sed script or something to preprocess the data.
I am not certain if its showing up correctly for me as on the forum its all one line, but if I quote it, then it does have a header. Anyways assuming it has the header that starts with ZZZZ, and ends with 2016 you would likely want to use the multiline codec and so that it moves all lines without a ZZZZ onto the previous line that was ZZZZ and had a date. Then you can just write the regex to pull out the date on the single line.
input {
stdin {
codec => multiline {
#compresses all lines between lines with ^ZZZZ into single line
# the \n will still be present in the new 'single' line
pattern => "^[^(ZZZZ)]|^$"
what => "previous"
}
}
}
filter {
mutate {
#remove \n in message and replaces with NEWLINE
gsub => ["message", "\n", "NEWLINE"]
}
grok {stuff}
}
The reason this doesn't work for my scenario is that the time stamp is changing with each line, just the date part is in the header.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.