How do I create a global date based on file header

I am parsing some sysstat (Linux sar commands) files. They sysstat files start with a header of the Linux kernel, hostname, date, and architecture. I was able to parse out what I wanted with this grok filter:

patterns_dir => ["./logstash_patterns"]
match => {
  "message" => "%{STAT_KERNEL:stat_kernel}\(%{HOSTNAME:stat_hostname}\) \t%{DATE_US:stat_date}%{GREEDYDATA:remaining_stat_message} "
}

After that I I can collect the time of entry and the stats reported with this grok filter:

match => {
  "message" => "%{TIME:stat_time}\s+%{NUMBER:stat_b_tps}\s+%{NUMBER:stat_b_rtps}\s+%{NUMBER:stat_b_wtps}\s+%{NUMBER:stat_b_bread}\s+%{NUMBER:stat_b_bwrtn}%{GREEDYDATA:remaining_stat3_message}"
}

What I can't figure out how to do is build a timestamp for each entry since the date and time are on separate lines and the date is only listed once. Can I store the data in some sort of variable to reference later?

Here is a snippet of the file I am parsing:

####### sa05-b.out ########
Linux 2.6.32.54-0.79.TDC.1.R.0-default (WAITROSE-1-9) 	09/04/16 	_x86_64_

16:00:02          tps      rtps      wtps   bread/s   bwrtn/s
16:05:01        48.31      9.20     39.11    258.27   1229.09
16:10:01        48.21      9.35     38.86     97.71   1012.97
08:40:01        40.93      9.66     31.27    278.61    988.54
08:45:01        45.21      9.54     35.67    185.97   1530.56
08:50:01        41.37      9.36     32.01    124.09    983.74
08:55:01        47.40      9.27     38.13    123.23   1058.12
09:00:02        40.87      9.47     31.40    216.35    897.70
09:05:01        48.37      9.85     38.52    275.62   1205.39
09:10:01        47.12      9.33     37.79    114.50    967.01
09:15:01        47.33      9.88     37.45    334.40   1277.19
09:20:01        42.01     10.09     31.92    278.22   1158.59
Average:        57.66     18.81     38.85   2348.31   2472.59

Thanks!

This is a snippet of how my current rubydebug output looks for the data I am interested in:

{
         "@version" => "1",
       "@timestamp" => "2016-09-19T17:38:00.941Z",
             "tags" => [
        [0] "sysstat",
        [1] "b"
    ],
             "host" => "WAITROSE-1-9",
      "stat_kernel" => "Linux 2.6.32.54-0.79.TDC.1.R.0-default ",
    "stat_hostname" => "WAITROSE-1-9",
        "stat_date" => "09/06/16"
}
{
        "@version" => "1",
      "@timestamp" => "2016-09-19T17:38:01.718Z",
            "tags" => [
        [0] "sysstat",
        [1] "b"
    ],
            "host" => "%{stat_hostname}",
       "stat_time" => "00:30:01",
      "stat_b_tps" => "75.74",
     "stat_b_rtps" => "51.89",
     "stat_b_wtps" => "23.84",
    "stat_b_bread" => "2592.35",
    "stat_b_bwrtn" => "3440.23"
}

For now I just used python to set environment variables which get from reading the file with python, then the logstash reads the environmental variable.

If you don't mind could you provide an example? I have a similar issue that could be solved using this method.
Thanks.

So you don't really need python. I just already had a python script opening a tar file from a customer's system.

python method:

with open("path_to_file_with_header", 'r') as sysstat_file:
    first_line = sysstat_file.readline()
    # grabs the 4th column, as it starts counting at 0
    sysstat_date = first_line.split()[3]
    os.environ["LOGSTASH_SYSSTAT_DATE"] = sysstat_date
    # print to see the date is what we thought.
    print os.environ["LOGSTASH_SYSSTAT_DATE"]

bash method:

# grabs the 4th column as 0 is the entire string and 1 is the first column
export LOGSTASH_SYSSTAT_DATE="$(head -n 1 path_to_file_with_header  | awk '{print $4}')"
# echo to see that it outputs as expected
echo $LOGSTASH_SYSSTAT_DATE

Within the logstash.conf file do like this:

input {
    stdin {
    tags => [ "${LOGSTASH_TAG:null}"]
  }
}
filter {
  grok { match {"message" => "%{TIME:stat_time}%{GREEDYDATA:remaining_stat_message}"}}
  mutate {
  add_field => {
    "sys_log_timestamp" => "${LOGSTASH_SYSSTAT_DATE:null} %{stat_time}"
  }
  date {
  match => [ "sys_log_timestamp",
            "MM/dd/yy HH:mm:ss"]
  target => "@timestamp"
}
# If the date can't get parsed, I don't want it in my elastic search, so I just drop it.
if "_dateparsefailure" in [tags] {
    drop{}
}
}
output {
  stdout {
    codec => rubydebug
  }
  # elasticsearch {
  #   hosts => ["127.0.0.1:9200"]
  #   index => "${LOGSTASH_NAME:customer_investigation}"
  # }
}

Oh and lastly to enable reading a file once through as well as getting the environmental variables to show in logstash, run it like so:

logstash --allow-env -f logstash.conf < path_to_file_with_header

This still feels like the 'wrong' way to solve this problem, or at least not very proper ELK toolset. I would even prefer being able to set the environmental variable with logstash. I saw some solutions beforehand that included some ruby code within the logstash file that set a ruby variable, but I haven't dealt with much ruby and wasn't following how to repurpose their snippets for my scenario.

Ah I see what you mean. Likewise I wanted to avoid doing any pre-processing because I feel that logstash should be able to manage to process arbitrary input somehow. If it can't then there's something wrong with its design.

My own issue is here:

It seems to me that logstash struggles with anything that isn't a simple predictable log format of

timestamp followed by some data

but even my situation does have that format, just not quite in the way logstash wants it.

Thanks for the info, I'll look into using a cronned sed script or something to preprocess the data.

I am not certain if its showing up correctly for me as on the forum its all one line, but if I quote it, then it does have a header. Anyways assuming it has the header that starts with ZZZZ, and ends with 2016 you would likely want to use the multiline codec and so that it moves all lines without a ZZZZ onto the previous line that was ZZZZ and had a date. Then you can just write the regex to pull out the date on the single line.

input {
  stdin {
    codec => multiline {
      #compresses all lines between lines with ^ZZZZ into single line
      # the \n will still be present in the new 'single' line
      pattern => "^[^(ZZZZ)]|^$"
      what => "previous"
      }
  }
}
filter {
  mutate {
    #remove \n in message and replaces with NEWLINE
    gsub => ["message", "\n", "NEWLINE"]
  }
  grok {stuff}
}

The reason this doesn't work for my scenario is that the time stamp is changing with each line, just the date part is in the header.

For reference how my data looks where I needed to set an environmental variable.

sar -u -f sa04

Linux 2.6.32.54-0.79.TDC.1.R.0-default (WAITROSE-1-9) 	09/04/16 	_x86_64_

16:00:02        CPU      %usr     %nice      %sys   %iowait    %steal      %irq     %soft    %guest     %idle
16:05:01        all      0.02      0.00      0.05      0.02      0.00      0.03      0.06      0.00     99.83
16:10:01        all      0.01      0.00      0.04      0.01      0.00      0.03      0.06      0.00     99.85
08:40:01        all      0.01      0.00      0.02      0.03      0.00      0.03      0.05      0.00     99.86
08:45:01        all      0.02      0.00      0.05      0.01      0.00      0.03      0.06      0.00     99.82
08:50:01        all      0.00      0.00      0.02      0.01      0.00      0.03      0.05      0.00     99.88
08:55:01        all      0.16      0.00      0.12      0.04      0.00      0.03      0.06      0.00     99.59
09:00:02        all      0.01      0.00      0.02      0.01      0.00      0.03      0.05      0.00     99.87
09:05:01        all      0.02      0.00      0.04      0.02      0.00      0.03      0.06      0.00     99.84
09:10:01        all      0.01      0.00      0.03      0.02      0.00      0.03      0.05      0.00     99.86
09:15:01        all      0.03      0.00      0.07      0.02      0.00      0.03      0.06      0.00     99.79
09:20:01        all      0.01      0.00      0.02      0.01      0.00      0.03      0.05      0.00     99.87
Average:        all      0.03      0.00      0.06      0.05      0.00      0.03      0.06      0.00     99.78