Logstash 7 GROK Filter plugin 'break_on_match' appears to be broken


#1

Hi,

We have just installed v7 of Logstash and ElasticSearch. Our configuration worked fine in 6.7 of both. What we have noticed is that under 7.0 the 'break_on_match' option does not appear to be honoured anymore.

An example grok filter that used to work is:

grok {
  named_captures_only => "true"
  patterns_dir => [ "/logstash/patterns" ]
  # will match where the target is reachable
  match => { "message" => "%{TCP_TIME:logTime} %{GREEDYDATA} %{HOST} %{TARGET:target} %{GREEDYDATA} is %{WORD:state} %{GREEDYDATA} %{PACKET_RETURN_TIME} %{GREEDYDATA:packetTime} ms" }
  match => { "message" => "%{TCP_TIME:logTime} %{GREEDYDATA} %{HOST} %{TARGET:target} is %{WORD:state}" }
}

Under 7.0 in order to get the same behaviour requires it to be changed to be:

 grok {
  named_captures_only => "true"
  patterns_dir => [ "/logstash/patterns" ]
  # will match where the target is reachable
  match => { "message" => "%{TCP_TIME:logTime} %{GREEDYDATA} %{HOST} %{TARGET:target} %{GREEDYDATA} is %{WORD:state} %{GREEDYDATA} %{NUMBER:packetTime} ms" }
}
# HACK: due to what appears to be a bug in LS 7.0 GROK Filter need to see if we have extracted a log time, if we have then the pattern above found a match, if not then
# attempt to match the pattern below.
if [logTime] !~ /.+/ {
  grok {
    named_captures_only => "true"
    patterns_dir => [ "/opt/logstash/logstash-7.0.0/patterns" ]
    # will match where the target is unreachable
    match => { "message" => "%{TCP_TIME:logTime} %{GREEDYDATA} %{HOST} %{TARGET:target} is %{WORD:state}" }
  }
}

Without doing the above the _grokparsefailure tag is appended to the document in ES.

The documentation for the grok filter in 7.0 still says that 'break_on_match' is valid, so would appear to be a bug.

Anybody else come across this?

Cheers


#2

If you have two match options for the same field my expectation in logstash 6 is that the first pattern would be ignored and only the second pattern would be evaluated.


#3

Hi Badger,

Example 'good' and 'bad' log file entries that are processed by the above configuration:
Apr 12 00:01:31 [INFO] Host server1 [10.228.155.28] is reachable via tcp on port 4 (packet return time: 0.332 ms)
Apr 12 00:02:22 [INFO] Host server1 [10.228.155.28] is unreachable

logstash 6
The first log line matches the first pattern and with break_on_match the grok filter stops with the following fields populated:
logTime=Apr 12 00:01:31
target=server1
state=reachable
packetTime=0.332

The second log line doesn't match the first pattern, but does match the second pattern, which results in:
logTime=Apr 12 00:02:22
target=server1
state=unreachable

logstash 7
results in a _grokparsefailure. Modifying the config to put the if statement checking for the presence of logTime before attempting to match the second pattern appears to get around the parse issue.

So, something has changed between the versions. I have also explicitly added an end of line '$' to the second pattern so it will not match a 'good' log line to no avail.

We also have another section that parses a different log file which relies on the behaviour of 'break_on_match' which also no longer works.

Cheers
Paul


#4

I have to disagree. With this configuration in 6.7

input { generator { count => 1 message => 'Apr 12 00:01:31 [INFO] Host server1 [10.228.155.28] is reachable via tcp on port 4 (packet return time: 0.332 ms)' } }
input { generator { count => 1 message => 'Apr 12 00:02:22 [INFO] Host server1 [10.228.155.28] is unreachable ' } }

filter {
    grok {
        named_captures_only => "true"
        pattern_definitions => { "TCP_TIME" => "([A-Z][a-z]{2} [0-9]+ [0-9]{2}:[0-9]{2}:[0-9]{2})" "HOST" => "%{HOSTNAME}" "TARGET" => "\[%{IP}\]" }
        # will match where the target is reachable
        match => { "message" => "%{TCP_TIME:logTime} %{GREEDYDATA} %{HOST} %{TARGET:target} %{GREEDYDATA} is %{WORD:state} %{GREEDYDATA} %{PACKET_RETURN_TIME} %{GREEDYDATA:packetTime} ms" }
        match => { "message" => "%{TCP_TIME:logTime} %{GREEDYDATA} %{HOST} %{TARGET:target} is %{WORD:state}" }
    }
}

both messages are only matched against the second match. In fact it doesn't complain that PACKET_RETURN_TIME (or XXPACKET_RETURN_TIME if you change it) is undefined, because grok never even compiles the first match option.

I totally agree that the application of _grokparsefailure is broken, but you are not providing a reproducible example that demonstrates it.


#5

Hi Badger,

Think the issue might be assumptions around what the patterns actually match :-). I should have included the patterns (my bad).

The patterns are:
TCP_TIME %{MONTH} %{MONTHDAY} %{TIME}
HOST Host
TARGET ([A-Za-z0-9-.]+)
PACKET_RETURN_TIME (packet return time:

Hopefully, this now makes sense.

Cheers
Paul


(Ponquersohn) #6

Hi,
there is definitely something wrong.
sample code:


input {
    generator { count => 1 lines => [ 'Aug 18 01:02:03 asd ecelerity[12345]: 1534642187|0a8018c9-ab9ff700000058d7-22-5b78c80bd67f|ACCEPT|10.0.0.1:12345', 'Aug 18 01:02:03 asd bmserver[12345]: 1534642187|0a8018c9-ab9ff700000058d7-22-5b78c80bd67f|EHLO|mx1.asd.dd' ] }
}

filter { 
    grok {
        break_on_match => true
        pattern_definitions => {
            "UID" => "[0-9a-f]{8}-[0-9a-f]{16}-[0-9a-f]{2}-[0-9a-f]{12}"
            "ACCEPT" => "ACCEPT"
            "ACCEPTIPPORT" => "%{IP:accept_ip}:%{NUMBER:accept_port}"
            "ACCEPTIPPORT_OBJ" => "%{IP:[accept][accept_ip]}:%{NUMBER:[accept][accept_port]}"
            "ACCEPT_OBJ_PTRN" => "%{NUMBER:[accept][date]}\|%{UID:uid}\|%{ACCEPT:action}\|%{ACCEPTIPPORT_OBJ:[accept][accept_ip_port]}"
            "EHLO" => "EHLO"
            "EHLO_OBJ_PTRN" => "%{NUMBER:[ehlo][date]}\|%{UID:uid}\|%{EHLO:action}\|%{GREEDYDATA:[ehlo][ehlo_host]}"


        }
        match => { "message" => "%{ACCEPT_OBJ_PTRN}"}
        #match => { "message" => "%{EHLO_OBJ_PTRN}"}
        #match => { "message" => "%{NUMBER:date}\|%{UID:uid}\|%{GREEDYDATA:_else}"}
    }

    if [_else] {
    # 	drop {}
        mutate {add_field => { "action" => "PARSE_ERROR" }}
   } 
}


output {
	stdout { codec => rubydebug }
}

Gives:

{
      "@version" => "1",
        "action" => "ACCEPT",
      "sequence" => 0,
        "accept" => {
        "accept_ip_port" => "10.0.0.1:12345",
           "accept_port" => "12345",
                  "date" => "1534642187",
             "accept_ip" => "10.0.0.1"
    },
        "uid" => "0a8018c9-ab9ff700000058d7-22-5b78c80bd67f",
    "@timestamp" => 2019-04-21T04:09:18.452Z,
          "host" => "feeder.acme.com",
       "message" => "Aug 18 01:02:03 asd ecelerity[12345]: 1534642187|0a8018c9-ab9ff700000058d7-22-5b78c80bd67f|ACCEPT|10.0.0.1:12345"
}
{
      "@version" => "1",
      "sequence" => 0,
    "@timestamp" => 2019-04-21T04:09:18.452Z,
          "tags" => [
        [0] "_grokparsefailure"
    ],
          "host" => "feeder.acme.com",
       "message" => "Aug 18 01:02:03 asd bmserver[12345]: 1534642187|0a8018c9-ab9ff700000058d7-22-5b78c80bd67f|EHLO|mx1.asd.dd"
}

This time _grokparsefailure is expected because we don't have matching pattern but if we uncomment the second match logstash will generate:

{
      "@version" => "1",
      "sequence" => 0,
    "@timestamp" => 2019-04-21T04:10:24.371Z,
          "tags" => [
        [0] "_grokparsefailure"
    ],
          "host" => "feeder.acme.com",
       "message" => "Aug 18 01:02:03 asd ecelerity[12345]: 1534642187|0a8018c9-ab9ff700000058d7-22-5b78c80bd67f|ACCEPT|10.0.0.1:12345"
}
{
      "@version" => "1",
        "action" => "EHLO",
      "sequence" => 0,
           "uid" => "0a8018c9-ab9ff700000058d7-22-5b78c80bd67f",
          "ehlo" => {
        "ehlo_host" => "mx1.asd.dd",
             "date" => "1534642187"
    },
    "@timestamp" => 2019-04-21T04:10:24.371Z,
          "host" => "feeder.acme.com",
       "message" => "Aug 18 01:02:03 asd bmserver[12345]: 1534642187|0a8018c9-ab9ff700000058d7-22-5b78c80bd67f|EHLO|mx1.asd.dd"
}

First _grokparsefailure is not expected because break_on_match is set.

If we uncomment third line logstash will generate:

{
      "@version" => "1",
        "action" => "PARSE_ERROR",
      "sequence" => 0,
        "uid" => "0a8018c9-ab9ff700000058d7-22-5b78c80bd67f",
         "_else" => "EHLO|mx1.asd.dd",
    "@timestamp" => 2019-04-21T04:14:15.872Z,
          "host" => "feeder.acme.com",
       "message" => "Aug 18 01:02:03 asd bmserver[12345]: 1534642187|0a8018c9-ab9ff700000058d7-22-5b78c80bd67f|EHLO|mx1.asd.dd",
          "date" => "1534642187"
}
{
      "@version" => "1",
        "action" => "PARSE_ERROR",
      "sequence" => 0,
        "uid" => "0a8018c9-ab9ff700000058d7-22-5b78c80bd67f",
         "_else" => "ACCEPT|10.0.0.1:12345",
    "@timestamp" => 2019-04-21T04:14:15.872Z,
          "host" => "feeder.acme.com",
       "message" => "Aug 18 01:02:03 asd ecelerity[12345]: 1534642187|0a8018c9-ab9ff700000058d7-22-5b78c80bd67f|ACCEPT|10.0.0.1:12345",
          "date" => "1534642187"
}

According to documentation logstash should break on first match in this grok filter but is going further and matches only on last.

This code used to work in logstash 6.x. I didn't find anything interesting in breaking changes so maybe I'm missing something... or its a bug indeed.

Cheers


#7

That is all working as expected. What you appear not to realize is that

grok {
    match => { "message" => "%{somePattern}"}
    match => { "message" => "%{anotherPattern}"}
    match => { "message" => "%{aThirdPattern}" }
}

is equivalent to

grok {
    match => { "message" => "%{aThirdPattern}" }
}

When there are multiple match options for the same field only the last is implemented. The others are ignored.


(Ponquersohn) #8

But it used to work in 6.x:

# logstash -f /etc/logstash/conf.d/smg/test.logstash.conf --config.reload.automatic
WARNING: Could not find logstash.yml which is typically located in $LS_HOME/config or /etc/logstash. You can specify the path using --path.settings. Continuing using the defaults
Could not find log4j2 configuration at path /usr/share/logstash/config/log4j2.properties. Using default config which logs errors to the console
[WARN ] 2019-04-21 00:37:51.168 [LogStash::Runner] multilocal - Ignoring the 'pipelines.yml' file because modules or command line options are specified
[INFO ] 2019-04-21 00:37:51.187 [LogStash::Runner] runner - Starting Logstash {"logstash.version"=>"6.7.1"}
[INFO ] 2019-04-21 00:38:01.380 [Converge PipelineAction::Create<main>] pipeline - Starting pipeline {:pipeline_id=>"main", "pipeline.workers"=>4, "pipeline.batch.size"=>125, "pipeline.batch.delay"=>50}
[INFO ] 2019-04-21 00:38:01.750 [Converge PipelineAction::Create<main>] pipeline - Pipeline started successfully {:pipeline_id=>"main", :thread=>"#<Thread:0x1e8dce21 sleep>"}
[INFO ] 2019-04-21 00:38:01.910 [Ruby-0-Thread-1: /usr/share/logstash/lib/bootstrap/environment.rb:6] agent - Pipelines running {:count=>1, :running_pipelines=>[:main], :non_running_pipelines=>[]}
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/awesome_print-1.7.0/lib/awesome_print/formatters/base_formatter.rb:31: warning: constant ::Fixnum is deprecated
{
        "action" => "EHLO",
      "@version" => "1",
          "ehlo" => {
        "ehlo_host" => "mx1.asd.dd",
             "date" => "1534642187"
    },
           "uid" => "0a8018c9-ab9ff700000058d7-22-5b78c80bd67f",
    "@timestamp" => 2019-04-21T04:38:01.881Z,
       "message" => "Aug 18 01:02:03 asd bmserver[12345]: 1534642187|0a8018c9-ab9ff700000058d7-22-5b78c80bd67f|EHLO|mx1.asd.dd",
          "host" => "feeder.acme.com",
      "sequence" => 0
}
{
        "accept" => {
                  "date" => "1534642187",
        "accept_ip_port" => "10.0.0.1:12345",
             "accept_ip" => "10.0.0.1",
           "accept_port" => "12345"
    },
        "action" => "ACCEPT",
      "@version" => "1",
           "uid" => "0a8018c9-ab9ff700000058d7-22-5b78c80bd67f",
    "@timestamp" => 2019-04-21T04:38:01.858Z,
       "message" => "Aug 18 01:02:03 asd ecelerity[12345]: 1534642187|0a8018c9-ab9ff700000058d7-22-5b78c80bd67f|ACCEPT|10.0.0.1:12345",
          "host" => "feeder.acme.com",
      "sequence" => 0
}
[INFO ] 2019-04-21 00:38:02.859 [Api Webserver] agent - Successfully started Logstash API endpoint {:port=>9600}
[INFO ] 2019-04-21 00:38:02.941 [[main]-pipeline-manager] pipeline - Pipeline has terminated {:pipeline_id=>"main", :thread=>"#<Thread:0x1e8dce21 run>"}

#9

I am running 6.7.1 and it does not work in that.


(Ponquersohn) #10

My previous example was from 6.7.1 and logs show its working.

I couldn't find anything about:

When there are multiple match options for the same field only the last is implemented. The others are ignored.

in documentation. Regardless if previous example worked in 6.x or not this:

input {
    generator { count => 1 lines => [ 'Aug 18 01:02:03 asd ecelerity[12345]: 1534642187|0a8018c9-ab9ff700000058d7-22-5b78c80bd67f|ACCEPT|10.0.0.1:12345', 'Aug 18 01:02:03 asd bmserver[12345]: 1534642187|0a8018c9-ab9ff700000058d7-22-5b78c80bd67f|EHLO|mx1.asd.dd' ] }
}

filter { 
    grok {
        break_on_match => true
        pattern_definitions => {
            "UID" => "[0-9a-f]{8}-[0-9a-f]{16}-[0-9a-f]{2}-[0-9a-f]{12}"
            "ACCEPT" => "ACCEPT"
            "ACCEPTIPPORT" => "%{IP:accept_ip}:%{NUMBER:accept_port}"
            "ACCEPTIPPORT_OBJ" => "%{IP:[accept][accept_ip]}:%{NUMBER:[accept][accept_port]}"
            "ACCEPT_OBJ_PTRN" => "%{NUMBER:[accept][date]}\|%{UID:uid}\|%{ACCEPT:action}\|%{ACCEPTIPPORT_OBJ:[accept][accept_ip_port]}"
            "EHLO" => "EHLO"
            "EHLO_OBJ_PTRN" => "%{NUMBER:[ehlo][date]}\|%{UID:uid}\|%{EHLO:action}\|%{GREEDYDATA:[ehlo][ehlo_host]}"


        }
        match => { "message" => ["%{ACCEPT_OBJ_PTRN}", "%{EHLO_OBJ_PTRN}"] }
        #match => { "message" => "%{NUMBER:date}\|%{UID:uid}\|%{GREEDYDATA:_else}"}
    }

    if [_else] {
    # 	drop {}
        mutate {add_field => { "action" => "PARSE_ERROR" }}
   } 
}


output {
	stdout { codec => rubydebug }
}

Works in 6.x and in 7.0:

# logstash -f /etc/logstash/conf.d/smg/test.logstash.conf --config.reload.automatic
WARNING: Could not find logstash.yml which is typically located in $LS_HOME/config or /etc/logstash. You can specify the path using --path.settings. Continuing using the defaults
Could not find log4j2 configuration at path /usr/share/logstash/config/log4j2.properties. Using default config which logs errors to the console
[WARN ] 2019-04-21 00:53:01.216 [LogStash::Runner] multilocal - Ignoring the 'pipelines.yml' file because modules or command line options are specified
[INFO ] 2019-04-21 00:53:01.229 [LogStash::Runner] runner - Starting Logstash {"logstash.version"=>"7.0.0"}
[INFO ] 2019-04-21 00:53:09.881 [[main]-pipeline-manager] javapipeline - Starting pipeline {:pipeline_id=>"main", "pipeline.workers"=>4, "pipeline.batch.size"=>125, "pipeline.batch.delay"=>50, "pipeline.max_inflight"=>500, :thread=>"#<Thread:0x832a9c4 run>"}
[INFO ] 2019-04-21 00:53:10.000 [[main]-pipeline-manager] javapipeline - Pipeline started {"pipeline.id"=>"main"}
[INFO ] 2019-04-21 00:53:10.146 [Ruby-0-Thread-1: /usr/share/logstash/lib/bootstrap/environment.rb:6] agent - Pipelines running {:count=>1, :running_pipelines=>[:main], :non_running_pipelines=>[]}
[INFO ] 2019-04-21 00:53:10.706 [Api Webserver] agent - Successfully started Logstash API endpoint {:port=>9600}
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/awesome_print-1.7.0/lib/awesome_print/formatters/base_formatter.rb:31: warning: constant ::Fixnum is deprecated
{
           "uid" => "0a8018c9-ab9ff700000058d7-22-5b78c80bd67f",
       "message" => "Aug 18 01:02:03 asd ecelerity[12345]: 1534642187|0a8018c9-ab9ff700000058d7-22-5b78c80bd67f|ACCEPT|10.0.0.1:12345",
        "accept" => {
             "accept_ip" => "10.0.0.1",
           "accept_port" => "12345",
        "accept_ip_port" => "10.0.0.1:12345",
                  "date" => "1534642187"
    },
    "@timestamp" => 2019-04-21T04:53:10.115Z,
          "host" => "feeder.acme.com",
        "action" => "ACCEPT",
      "@version" => "1",
      "sequence" => 0
}
{
           "uid" => "0a8018c9-ab9ff700000058d7-22-5b78c80bd67f",
       "message" => "Aug 18 01:02:03 asd bmserver[12345]: 1534642187|0a8018c9-ab9ff700000058d7-22-5b78c80bd67f|EHLO|mx1.asd.dd",
    "@timestamp" => 2019-04-21T04:53:10.171Z,
          "host" => "feeder.acme.com",
          "ehlo" => {
        "ehlo_host" => "mx1.asd.dd",
             "date" => "1534642187"
    },
        "action" => "EHLO",
      "@version" => "1",
      "sequence" => 0
}

It seems its all about notation but I liked the previous one better... it was somehow more readable.

Cheers,


#11

Hi,

I have to agree with @ponquersohn. We are seeing the behaviour of 'break_on_match' working as we would expect up until version 6.7.1. It is in version 7 that it stops working.

To me, the fact that the parameter is called 'break_on_match' implies first match rather than last match. To avoid ambiguity maybe the parameter should be renamed 'break_on_last_match', and then a 'break_on_first_match' could be implemented? Or, the documentation could be explicit on the behaviour.

So, if it is in fact break on last match, will reordering the patterns in the GROK filter resolve the issue? I.e. put the more generic patterns first working down to the specific patterns.

Thanks
Paul


#12

I get it now. It's a pipeline.java_execution thing. That is on by default in 7. I've had it enabled since it was experimental. That's why 6.x behaves differently for me.


(Ponquersohn) #13

@paul_nzl
reordering wont help but changing the notation will.
This will continue until every pattern is checked:

grok {
    match => { "message" => "%{somePattern}"}
    match => { "message" => "%{anotherPattern}"}
    match => { "message" => "%{aThirdPattern}" }
}

and this will break_on_match:

grok {
    match => { "message" => [ "%{somePattern}", "%{anotherPattern}", "%{aThirdPattern}"] }
}

I still believe it is a bug but workaround works for me.

Cheers,


#14

@Badger,

Thanks. I have just explicitly set 'pipeline.java_execution' to be 'false', reverted the logstash.conf file to have the GROK filters as defined for v6.7 and it is now working as expected.

So, would appear to be a difference between the jruby and java implementations for GROK?

@ponquersohn - thanks, might give that a try as well. Guess comes down to whether we are overly concerned around the performance of jruby vs java execution engines.

Cheers
Paul