Help with conf file to process log with binary data

sconrod · December 7, 2017, 6:38pm

Hi I have log files that contains some binary data. For now I just wanted to try to ingest them using stdin and then later build a grok filter. However, it is not ingesting the first log file, no error, it just hangs.

My conf file:

input {
file {
path => "/opt/sample-data/xx-Logs-v1/*.log"
start_position => "beginning"
sincedb_path => "/dev/null"
}
}

output {
elasticsearch {
hosts => "http://10.0.2.15:9200"
index => "xx-logs-v1.0"
}
stdout {}
}

==================================================

Sample Log File Content:

Mon 11/13/2017 15:06:21.54 xx_ver_xxx 1710

xx_ver_xxx 1710

processing C:\opt\xxBuild\1710\xx_xx_matched.dat vs C:\opt\xxBuild\1710\xx_xx_matched.dat.
C:\opt\xxBuild\1710\xx_xx_matched.dat - Opened
C:\opt\xxBuild\1710\xx_xx_matched.dat - Opened
...not yet at EOF for C:\opt\xxBuild\1709\xx_xx_matched.dat... 0 recs left over.

Done (0 00:00:13).
198830419 recs read from C:\opt\xxBuild\1710\xx_xx_matched.dat.
198595230 recs read from C:\opt\xxBuild\1709\xx_xx_matched.dat.
198611163 recs matched Z11
124531056 of these were both MF == 1.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 313746 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Badger · December 8, 2017, 2:11am

@sconrod, if you want to debug a logstash input or filter, then do not send the data to elasticsearch. Set your output to

output { stdout { codec => rubydebug } }

Review the output from logstash (do not try to review the input to elasticsearch using Kibana). Once you have the right data in the right fields, with the right types, in the logstash output, then you can start putting it into ES. (When I say types, as an example, if you need to do mutate/convert to make something an integer, make sure it looks like

"someField" => 42,

in the rubydebug output and not

"someField" => "42",

which would indicate it is a string. It looks like a trivial difference, but it is not, since it avoids mapping conflicts as you do incremental debugging.)

The file input expects newline delimited files. If you literally have binary data in your logs, that probably is not going to work well. If you have multi-line hex dumps of binary data in your logs it might work, or perhaps a multiline input codec would be better. Without seeing more detail on the input, and especially the rubydebug output, it is hard to tell.

sconrod · December 11, 2017, 7:59pm

Hi Badger, I did try that but it never processes..it just hangs forever and no error. I do not have this happen with any other log type, just the log that contains binary data.

sconrod · December 18, 2017, 7:55pm

Hi Badger, I got the log ingested from standard in.

Now I want to just grok out several fields of the data and not see the rest.

Here is the lines of data out of the log I am interested in:

Mon 11/13/2017 15:06:21.54 s1_ver_cmp 1710

s1_ver_cmp 1710

198830419 recs read from D:\data\s1Build\1710\data_s1_matched.dat.
198595230 recs read from D:\data\s1Build\1709\data_s1_matched.dat.
198611163 recs matched Z11
124531056 of these were both MF == 1.

Percent of data exact matches that changed s1 code: 0.000 %
Percent of data exact matches that stayed the same s1 code: 100.000 %

Here is my mapping:
PUT e1-logs-v4
{
"mappings": {
"doc": {
"properties": {
"Name":{
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
},
"Number": { "type": "integer","ignore_malformed": true},
"FileDate": { "type": "date" },
"Path": { "type": "text" },
"filename": {"type": "keyword"}
}
}
}
}

I would like to grok out as follows to the mappings:

Mon 11/13/2017 15:06:21.54
grok to date in YYYYMMDD format

s1_ver_cmp 1710
grok to filename

198830419 recs read from D:\data\s1Build\1710\data_s1_matched.dat.
grok to new field that is aggregatable: recs_read

198595230 recs read from D:\data\s1Build\1709\data_s1_matched.dat.
grok the number to the number field
grok the recs read from to the name field

198611163 recs matched Z11
grok to the field: number and recs matched z11 to name
124531056 of these were both MF == 1.

Percent of data exact matches that changed s1 code: 0.000 %
grok the percentage to the number field in a percentage format Or create a new field "percent"

Percent of data exact matches that stayed the same s1 code: 100.000 %
same as last one

Badger · December 18, 2017, 10:34pm

@sconrod Firstly, I just checked in for the first time in a week, and had that not been within a hour or so of you posting I would never have seen your reply. If you start the post with an @ mention then I would get a notification of it and would have been notified of your reply even if I had been away for another week.

By default a stdin filter creates a separate event for each line of input. Thus "Mon 11/13/2017 15:06:21.54" and "198611163 recs matched Z11" are separate events, which would after processing would be indexed as separate documents. I think it is more likely you want the entire multi-line log entry treated as a single event. For that you need to find a regexp that tells the logstash input when to start a new event. Does every new event, and only new events have a date at the start of a line?

Badger · December 18, 2017, 11:37pm

Assuming they do, the first thing is to configure a multiline codec on your stdin input. That will look something like this

input {
  stdin {
    codec => multiline {
      pattern => "^(Mon|Tue|Wed|Thu|Fri|Sat|Sun) "
      negate => true
      what => previous
    }
  }
}

output { stdout { codec => rubydebug } }

That says to start a new event every time logstash sees a line that does NOT match the pattern. Note that this means that if you just feed one event to logstash it cannot consume it, because it has not seen the start of the second event, which is what tells it the first event is all there.

Once you have that working your events will have a message field that looks like this

 "message" => "Mon 11/13/2017 15:06:21.54 s1_ver_cmp 1710\n\ns1_ver_cmp 1710\n\n198830419 recs read from D:\\data\\s1Build\\1710\\data_s1_matched.dat.\n198595230 recs read from D:\\data\\s1Build\\1709\\data_s1_matched.dat.\n198611163 recs matched Z11\n124531056 of these were both MF == 1.\n\nPercent of data exact matches that changed s1 code: 0.000 %\nPercent of data exact matches that stayed the same s1 code: 100.000 %",

You could try to build a single grok filter that matches that, but it will be extremely fragile as you edit it and you will drive yourself nuts doing it. So build grok filters that match each line. There is no magic quoting syntax for a newline inside a grok pattern, just split the pattern across lines. So for example, we call pull out the percent of match that stayed the same using:

grok {
    match => { "message" => [ "
Percent of data exact matches that stayed the same s1 code: %{NUMBER:percentSame:float} %" ] }
}

The two lines that show the number of records read would both match the same pattern, so we need a pattern that combines them. If you remove the third %{INT} you will find out why I added it

  grok {
    match => { "message" => [ "
%{INT:firstFileRecords:int} recs read from %{PATH:firstFile}\.
%{INT:secondFileRecords:int} recs read from %{PATH:secondFile}\.
%{INT}" ] }
  }

The date is the only one that we can anchor using ^

grok {
  match => { "message" => [ "^%{DAY} %{DATE:date} %{TIME:time}\.%{INT:subsecond}" ] }
  add_field => { "stamp" => "%{date} %{time}.%{subsecond}" }
}

Then you can use a date filter to parse the stamp field into the timestamp, then a mutate/remove_field to get rid of the intermediates (once you know they are working properly).

sconrod · December 19, 2017, 9:45pm

@Badger
Hi Thank you I have tried this and the first step worked in regards to to setting the multi-line codec, but when I added in the remaining grok filters I am getting a parse error even tho I verified the YAML is valid with a YAML linter. I believe it is an issue with my syntax I would greatly appreciate you taking a look:

My conf file:
input {
stdin {
codec => multiline {
pattern => "^(Mon|Tue|Wed|Thu|Fri|Sat|Sun) "
negate => true
what => previous
}
}
file {
path => "/opt/sample-data/E1-logs-v1/*.log"
start_position => "beginning"
sincedb_path => "/dev/null"
}
}

filter {

grok {
match => ["path", "/opt/sample-data/E1-logs-v1/%{DATA:filename:keyword}.log"] }

grok {
match => [ "message","Percent of data exact matches that stayed the same s1 code:", %{NUMBER:percentSame:float} %" ] }

grok {
match => [ "message","%{INT:firstFileRecords:int} recs read from %{PATH:firstFile}.%{INT:secondFileRecords:int} recs read from %{PATH:secondFile}.%{INT}" ] }

grok {
match => [ "message", "^%{DAY} %{DATE:date} %{TIME:time}.%{INT:subsecond}" ]
add_field => ["stamp","%{date} %{time}.%{subsecond}" ]
}
}

output {
elasticsearch {
hosts => "http://10.0.2.15:9200"
index => "e1-logs-v7"
}

stdout {codec => rubydebug }
}

[ERROR] 2017-12-19 13:43:07.685 [Ruby-0-Thread-1: /usr/share/logstash/vendor/bundle/jruby/2.3.0/gems/stud-0.0.23/lib/stud/task.rb:22] agent - Failed to execute action {:action=>LogStash::PipelineAction::Create/pipeline_id:main, :exception=>"LogStash::ConfigurationError", :message=>"Expected one of #, ", ', -, [, { at line 23, column 90 (byte 446) after filter {\n\ngrok {\n match => ["path", "/opt/sample-data/E1-logs-v1/%{DATA:filename:keyword}.log"] }\n\n\n\ngrok {\n match => [ "message", "Percent of data exact matches that stayed the same s1 code:", ", :backtrace=>["/usr/share/logstash/logstash-core/lib/logstash/compiler.rb:42:in compile_ast'", "/usr/share/logstash/logstash-core/lib/logstash/compiler.rb:50:incompile_imperative'", "/usr/share/logstash/logstash-core/lib/logstash/compiler.rb:54:in compile_graph'", "/usr/share/logstash/logstash-core/lib/logstash/compiler.rb:12:inblock in compile_sources'", "org/jruby/RubyArray.java:2486:in map'", "/usr/share/logstash/logstash-core/lib/logstash/compiler.rb:11:incompile_sources'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:107:in compile_lir'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:49:ininitialize'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:215:in initialize'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline_action/create.rb:35:inexecute'", "/usr/share/logstash/logstash-core/lib/logstash/agent.rb:335:in block in converge_state'", "/usr/share/logstash/logstash-core/lib/logstash/agent.rb:141:inwith_pipelines'", "/usr/share/logstash/logstash-core/lib/logstash/agent.rb:332:in block in converge_state'", "org/jruby/RubyArray.java:1734:ineach'", "/usr/share/logstash/logstash-core/lib/logstash/agent.rb:319:in converge_state'", "/usr/share/logstash/logstash-core/lib/logstash/agent.rb:166:inblock in converge_state_and_update'", "/usr/share/logstash/logstash-core/lib/logstash/agent.rb:141:in with_pipelines'", "/usr/share/logstash/logstash-core/lib/logstash/agent.rb:164:inconverge_state_and_update'", "/usr/share/logstash/logstash-core/lib/logstash/agent.rb:90:in execute'", "/usr/share/logstash/logstash-core/lib/logstash/runner.rb:362:inblock in execute'", "/usr/share/logstash/vendor/bundle/jruby/2.3.0/gems/stud-0.0.23/lib/stud/task.rb:24:in `block in initialize'"]}

Badger · December 20, 2017, 1:25pm

@sconrod Change

match => [ "message","Percent of data exact matches that stayed the same s1 code:", %{NUMBER:percentSame:float} %" ] }

to

match => [ "message","Percent of data exact matches that stayed the same s1 code: %{NUMBER:percentSame:float} %" ] }

When posting, if you wrap your confings in <pre> and </pre> they are easier for people to read.

sconrod · December 20, 2017, 11:21pm

@Badger
I got these two grok filters working looks like but the rest are not yet.....
thank you for these two working ones I am plugging away on the others and will update if I cannot get them going in another day. Thanks for all your help. Learning more..

grok {
match => ["path", "/opt/sample-data/E1-logs-v1/%{DATA:filename:keyword}.log"] }

grok {
match => { "message" => [ "Percent of MAF exact matches that stayed the same E1 code: %{NUMBER:percentSame:float} %" ] }
}

HOWEVER.....WHAT IS super wierd IS WHEN i add in the third one, it breaks all of them.....below are all three together....

grok {
match => ["path", "/opt/sample-data/E1-logs-v1/%{DATA:filename:keyword}.log"] }

grok {
match => { "message" => [ "Percent of MAF exact matches that stayed the same E1 code: %{NUMBER:percentSame:float} %" ] }
}

grok {
match => { "message" => [ "Percent of MAF exact matches that changed E1 code: %{NUMBER:percentChanged:float} %" ] }
}

here is the errror:

{
"path" => "/opt/sample-data/E1-logs-v1/e1_ver_cmp_1710_edit.log",
"@timestamp" => 2017-12-21T00:56:37.831Z,
"filename" => "e1_ver_cmp_1710_edit",
"@version" => "1",
"host" => "ubuntu-16",
"message" => "Percent of MAF exact matches that changed E1 code: 0.000 %",
"tags" => [
[0] "_grokparsefailure"
]
}
{
"path" => "/opt/sample-data/E1-logs-v1/e1_ver_cmp_1710_edit.log",
"percentSame" => 100.0,
"@timestamp" => 2017-12-21T00:56:37.848Z,
"filename" => "e1_ver_cmp_1710_edit",
"@version" => "1",
"host" => "ubuntu-16",
"message" => "Percent of MAF exact matches that stayed the same E1 code: 100.000 %",
"tags" => [
[0] "_grokparsefailure"
]
}

Badger · December 21, 2017, 1:48pm

You have the multiline codec for stdin, but not for the file input. So every line is being parsed independently for the files, and they will always fail to match one of the groks if you have more than one. Lines from stdin will not have a path field, so they will get a grok parse failure for that one.

sconrod · December 21, 2017, 6:46pm

@Badger

Hi I have that I should have pasted the entire config file...here is the entire config file....

 input {
stdin {
codec => multiline {
pattern => "^(Mon|Tue|Wed|Thu|Fri|Sat|Sun) "
negate => true
what => previous
}
}
  file {
    path => "/opt/sample-data/E1-logs-v1/*.log"
    start_position => "beginning"
    sincedb_path => "/dev/null"
  }
}

filter {

grok {
match => ["path", "/opt/sample-data/E1-logs-v1/%{DATA:filename:keyword}.log"] }

grok {
match => [ "message","Percent of MAF exact matches that stayed the same E1 code: %{NUMBER:percentSame:float} %" ] }

grok {
match => [ "message","Percent of MAF exact matches that changed E1 code: %{NUMBER:percentChanged:float} %" ] }

grok {
    match => { "message" => [ "%{INT:firstFileRecords:int} recs read from %{PATH:firstFile}\.%{INT:secondFileRecords:int} recs read from %{PATH:secondFile}\.%{INT}" ] }
  }

grok {
match => { "message" => [ "^%{DAY} %{DATE:date} %{TIME:time}\.%{INT:subsecond}" ] }
add_field => { "stamp" => "%{date} %{time}.%{subsecond}" }
}

}

output {
stdout {codec => rubydebug }
}

Badger · December 21, 2017, 8:37pm

@sconrod OK, so let's get rid of the stdin input, and just use the file input. That needs a multiline codec added (the one on the stdin input does not impact the file input).

input {
  file {
    codec => multiline {
      pattern => "^(Mon|Tue|Wed|Thu|Fri|Sat|Sun) "
      negate => true
      what => previous
    }
    path => "/tmp/X*.log"
    start_position => "beginning"
    sincedb_path => "/dev/null"
  }
}

filter {
  grok { match => { "path" => "/(?[^/]+).log" } }
  grok { match => { "message" => "^%{DAY} %{DATE:date} %{TIME:time}\.%{INT:subsecond}" } }
  grok { match => { "message" => "Percent of %{WORD} exact matches that stayed the same %{WORD} code: %{NUMBER:percentSame:float} %" } }
  grok { match => { "message" => "Percent of %{WORD} exact matches that changed %{WORD} code: %{NUMBER:percentChanged:float} %" } }
  grok { match => { "message" => "%{INT:firstFileRecords:int} recs read from %{PATH:firstFile}\.
%{INT:secondFileRecords:int} recs read from %{PATH:secondFile}\.
%{INT}" } }
  mutate { add_field => { "stamp" => "%{date} %{time}.%{subsecond}" } }
  date {
    match => [ "stamp", "MM/dd/yyyy HH:mm:ss.SS" ]
    timezone => "Asia/Baku"  
  }
  mutate { remove_field => [ "date",  "time", "subsecond", "stamp" ] }
}

It appears that once a grok fails, subsequent groks do not do anything, so keep the working ones at the top of the filter, and tune the one at the bottom. Yeah, I know you are probably not in Baku, it's a placeholder

sconrod · December 22, 2017, 12:55am

Badger:

input {
file {
codec => multiline {
pattern => "^(Mon|Tue|Wed|Thu|Fri|Sat|Sun) "
negate => true
what => previous
}
path => "/tmp/X*.log"
start_position => "beginning"
sincedb_path => "/dev/null"
}
}

filter {
grok { match => { "path" => "/(?[^/]+).log" } }
grok { match => { "message" => "^%{DAY} %{DATE:date} %{TIME:time}.%{INT:subsecond}" } }
grok { match => { "message" => "Percent of %{WORD} exact matches that stayed the same %{WORD} code: %{NUMBER:percentSame:float} %" } }
grok { match => { "message" => "Percent of %{WORD} exact matches that changed %{WORD} code: %{NUMBER:percentChanged:float} %" } }
grok { match => { "message" => "%{INT:firstFileRecords:int} recs read from %{PATH:firstFile}.
%{INT:secondFileRecords:int} recs read from %{PATH:secondFile}.
%{INT}" } }
mutate { add_field => { "stamp" => "%{date} %{time}.%{subsecond}" } }
date {
match => [ "stamp", "MM/dd/yyyy HH:mm:ss.SS" ]
timezone => "Asia/Baku"
}
mutate { remove_field => [ "date", "time", "subsecond", "stamp" ] }
}

@Badger
Would it be better to use filebeats for this?

Badger · December 22, 2017, 1:43pm

You are going to end up doing the parsing in logstash anyway, so how you get the files off disk does not make much difference.

sconrod · December 22, 2017, 10:07pm

@Badger
Still not working. I would like to start from scratch and simplify this.

Say I only want to grok these two lines an nothing else...

Percent of MAF exact matches that changed E1 code: 0.000 %
Percent of MAF exact matches that stayed the same E1 code: 100.000 %

So that I have two fields added:

These will be keyword
E1-MAF-MATCH
E1-MAF-CHANGED
These will be keyword

and the 100.00 will be an Integer

Here is my try:

grok{
match => ["message" => "%{DATA:log-message} "changed E1 code:" %{NUMBER:Number:INT}"]
add_field => ["MAF-E1-CHANGED:keyword", "%{log-message}"]
}

grok{
match => ["message" => "%{DATA:log-message} "stayed the same E1 code:" %{NUMBER:Number:INT}"]
add_field => ["MAF-E1-SAME:keyword", "%{log-message}"]
}

Badger · December 22, 2017, 11:06pm

@sconrod

grok{
  match => ["message" => "%{DATA:log-message} "changed E1 code:" %{NUMBER:Number:INT}"]
  add_field => ["MAF-E1-CHANGED:keyword", "%{log-message}"]
}

You can do either 'match => [ "fieldName", "pattern" ]' or 'match => { "fieldName" => "pattern"}'. Either will work, but you are mixing and matching, which does not work.
The pattern should be a single quoted string. Remove the quotes inside the pattern:

"%{DATA:log-message} "changed E1 code:" %{NUMBER:Number:INT}"

should be

"%{DATA:log-message} changed E1 code: %{NUMBER:Number:INT}"

Appending :keyword to the field probably does not do what you want it to do.
'%{NUMBER:Number:INT}' should be '%{NUMBER:Number:int}' (or even %{NUMBER:Number:float}')

system · January 19, 2018, 11:06pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Noob trying to process a file Logstash	3	397	November 22, 2019
Logstash unable to read file in 6.6.0 Logstash	7	715	March 16, 2019
Reading a log file into Logstash Logstash	9	25828	July 6, 2017
New to logstash: file input and stdout output not working Logstash	3	8376	July 6, 2017
Require Help in understanding logstash Logstash	12	1484	July 6, 2017

Help with conf file to process log with binary data

Related topics