Logstash 7 GROK Filter plugin 'break_on_match' appears to be broken

paul_nzl · April 12, 2019, 8:38am

Hi,

We have just installed v7 of Logstash and ElasticSearch. Our configuration worked fine in 6.7 of both. What we have noticed is that under 7.0 the 'break_on_match' option does not appear to be honoured anymore.

An example grok filter that used to work is:

grok {
  named_captures_only => "true"
  patterns_dir => [ "/logstash/patterns" ]
  # will match where the target is reachable
  match => { "message" => "%{TCP_TIME:logTime} %{GREEDYDATA} %{HOST} %{TARGET:target} %{GREEDYDATA} is %{WORD:state} %{GREEDYDATA} %{PACKET_RETURN_TIME} %{GREEDYDATA:packetTime} ms" }
  match => { "message" => "%{TCP_TIME:logTime} %{GREEDYDATA} %{HOST} %{TARGET:target} is %{WORD:state}" }
}

Under 7.0 in order to get the same behaviour requires it to be changed to be:

 grok {
  named_captures_only => "true"
  patterns_dir => [ "/logstash/patterns" ]
  # will match where the target is reachable
  match => { "message" => "%{TCP_TIME:logTime} %{GREEDYDATA} %{HOST} %{TARGET:target} %{GREEDYDATA} is %{WORD:state} %{GREEDYDATA} %{NUMBER:packetTime} ms" }
}
# HACK: due to what appears to be a bug in LS 7.0 GROK Filter need to see if we have extracted a log time, if we have then the pattern above found a match, if not then
# attempt to match the pattern below.
if [logTime] !~ /.+/ {
  grok {
    named_captures_only => "true"
    patterns_dir => [ "/opt/logstash/logstash-7.0.0/patterns" ]
    # will match where the target is unreachable
    match => { "message" => "%{TCP_TIME:logTime} %{GREEDYDATA} %{HOST} %{TARGET:target} is %{WORD:state}" }
  }
}

Without doing the above the _grokparsefailure tag is appended to the document in ES.

The documentation for the grok filter in 7.0 still says that 'break_on_match' is valid, so would appear to be a bug.

Anybody else come across this?

Cheers

Badger · April 12, 2019, 1:12pm

If you have two match options for the same field my expectation in logstash 6 is that the first pattern would be ignored and only the second pattern would be evaluated.

paul_nzl · April 14, 2019, 11:13pm

Hi Badger,

Example 'good' and 'bad' log file entries that are processed by the above configuration:
Apr 12 00:01:31 [INFO] Host server1 [10.228.155.28] is reachable via tcp on port 4 (packet return time: 0.332 ms)
Apr 12 00:02:22 [INFO] Host server1 [10.228.155.28] is unreachable

logstash 6
The first log line matches the first pattern and with break_on_match the grok filter stops with the following fields populated:
logTime=Apr 12 00:01:31
target=server1
state=reachable
packetTime=0.332

The second log line doesn't match the first pattern, but does match the second pattern, which results in:
logTime=Apr 12 00:02:22
target=server1
state=unreachable

logstash 7
results in a _grokparsefailure. Modifying the config to put the if statement checking for the presence of logTime before attempting to match the second pattern appears to get around the parse issue.

So, something has changed between the versions. I have also explicitly added an end of line '$' to the second pattern so it will not match a 'good' log line to no avail.

We also have another section that parses a different log file which relies on the behaviour of 'break_on_match' which also no longer works.

Cheers
Paul

Badger · April 14, 2019, 11:35pm

I have to disagree. With this configuration in 6.7

input { generator { count => 1 message => 'Apr 12 00:01:31 [INFO] Host server1 [10.228.155.28] is reachable via tcp on port 4 (packet return time: 0.332 ms)' } }
input { generator { count => 1 message => 'Apr 12 00:02:22 [INFO] Host server1 [10.228.155.28] is unreachable ' } }

filter {
    grok {
        named_captures_only => "true"
        pattern_definitions => { "TCP_TIME" => "([A-Z][a-z]{2} [0-9]+ [0-9]{2}:[0-9]{2}:[0-9]{2})" "HOST" => "%{HOSTNAME}" "TARGET" => "\[%{IP}\]" }
        # will match where the target is reachable
        match => { "message" => "%{TCP_TIME:logTime} %{GREEDYDATA} %{HOST} %{TARGET:target} %{GREEDYDATA} is %{WORD:state} %{GREEDYDATA} %{PACKET_RETURN_TIME} %{GREEDYDATA:packetTime} ms" }
        match => { "message" => "%{TCP_TIME:logTime} %{GREEDYDATA} %{HOST} %{TARGET:target} is %{WORD:state}" }
    }
}

both messages are only matched against the second match. In fact it doesn't complain that PACKET_RETURN_TIME (or XXPACKET_RETURN_TIME if you change it) is undefined, because grok never even compiles the first match option.

I totally agree that the application of _grokparsefailure is broken, but you are not providing a reproducible example that demonstrates it.

paul_nzl · April 15, 2019, 12:38am

Hi Badger,

Think the issue might be assumptions around what the patterns actually match :-). I should have included the patterns (my bad).

The patterns are:
TCP_TIME %{MONTH} %{MONTHDAY} %{TIME}
HOST Host
TARGET ([A-Za-z0-9-.]+)
PACKET_RETURN_TIME (packet return time:

Hopefully, this now makes sense.

Cheers
Paul

ponquersohn · April 21, 2019, 9:54pm

Hi,
there is definitely something wrong.
sample code:


input {
    generator { count => 1 lines => [ 'Aug 18 01:02:03 asd ecelerity[12345]: 1534642187|0a8018c9-ab9ff700000058d7-22-5b78c80bd67f|ACCEPT|10.0.0.1:12345', 'Aug 18 01:02:03 asd bmserver[12345]: 1534642187|0a8018c9-ab9ff700000058d7-22-5b78c80bd67f|EHLO|mx1.asd.dd' ] }
}

filter { 
    grok {
        break_on_match => true
        pattern_definitions => {
            "UID" => "[0-9a-f]{8}-[0-9a-f]{16}-[0-9a-f]{2}-[0-9a-f]{12}"
            "ACCEPT" => "ACCEPT"
            "ACCEPTIPPORT" => "%{IP:accept_ip}:%{NUMBER:accept_port}"
            "ACCEPTIPPORT_OBJ" => "%{IP:[accept][accept_ip]}:%{NUMBER:[accept][accept_port]}"
            "ACCEPT_OBJ_PTRN" => "%{NUMBER:[accept][date]}\|%{UID:uid}\|%{ACCEPT:action}\|%{ACCEPTIPPORT_OBJ:[accept][accept_ip_port]}"
            "EHLO" => "EHLO"
            "EHLO_OBJ_PTRN" => "%{NUMBER:[ehlo][date]}\|%{UID:uid}\|%{EHLO:action}\|%{GREEDYDATA:[ehlo][ehlo_host]}"


        }
        match => { "message" => "%{ACCEPT_OBJ_PTRN}"}
        #match => { "message" => "%{EHLO_OBJ_PTRN}"}
        #match => { "message" => "%{NUMBER:date}\|%{UID:uid}\|%{GREEDYDATA:_else}"}
    }

    if [_else] {
    # 	drop {}
        mutate {add_field => { "action" => "PARSE_ERROR" }}
   } 
}


output {
	stdout { codec => rubydebug }
}

Gives:

{
      "@version" => "1",
        "action" => "ACCEPT",
      "sequence" => 0,
        "accept" => {
        "accept_ip_port" => "10.0.0.1:12345",
           "accept_port" => "12345",
                  "date" => "1534642187",
             "accept_ip" => "10.0.0.1"
    },
        "uid" => "0a8018c9-ab9ff700000058d7-22-5b78c80bd67f",
    "@timestamp" => 2019-04-21T04:09:18.452Z,
          "host" => "feeder.acme.com",
       "message" => "Aug 18 01:02:03 asd ecelerity[12345]: 1534642187|0a8018c9-ab9ff700000058d7-22-5b78c80bd67f|ACCEPT|10.0.0.1:12345"
}
{
      "@version" => "1",
      "sequence" => 0,
    "@timestamp" => 2019-04-21T04:09:18.452Z,
          "tags" => [
        [0] "_grokparsefailure"
    ],
          "host" => "feeder.acme.com",
       "message" => "Aug 18 01:02:03 asd bmserver[12345]: 1534642187|0a8018c9-ab9ff700000058d7-22-5b78c80bd67f|EHLO|mx1.asd.dd"
}

This time _grokparsefailure is expected because we don't have matching pattern but if we uncomment the second match logstash will generate:

{
      "@version" => "1",
      "sequence" => 0,
    "@timestamp" => 2019-04-21T04:10:24.371Z,
          "tags" => [
        [0] "_grokparsefailure"
    ],
          "host" => "feeder.acme.com",
       "message" => "Aug 18 01:02:03 asd ecelerity[12345]: 1534642187|0a8018c9-ab9ff700000058d7-22-5b78c80bd67f|ACCEPT|10.0.0.1:12345"
}
{
      "@version" => "1",
        "action" => "EHLO",
      "sequence" => 0,
           "uid" => "0a8018c9-ab9ff700000058d7-22-5b78c80bd67f",
          "ehlo" => {
        "ehlo_host" => "mx1.asd.dd",
             "date" => "1534642187"
    },
    "@timestamp" => 2019-04-21T04:10:24.371Z,
          "host" => "feeder.acme.com",
       "message" => "Aug 18 01:02:03 asd bmserver[12345]: 1534642187|0a8018c9-ab9ff700000058d7-22-5b78c80bd67f|EHLO|mx1.asd.dd"
}

First _grokparsefailure is not expected because break_on_match is set.

If we uncomment third line logstash will generate:

{
      "@version" => "1",
        "action" => "PARSE_ERROR",
      "sequence" => 0,
        "uid" => "0a8018c9-ab9ff700000058d7-22-5b78c80bd67f",
         "_else" => "EHLO|mx1.asd.dd",
    "@timestamp" => 2019-04-21T04:14:15.872Z,
          "host" => "feeder.acme.com",
       "message" => "Aug 18 01:02:03 asd bmserver[12345]: 1534642187|0a8018c9-ab9ff700000058d7-22-5b78c80bd67f|EHLO|mx1.asd.dd",
          "date" => "1534642187"
}
{
      "@version" => "1",
        "action" => "PARSE_ERROR",
      "sequence" => 0,
        "uid" => "0a8018c9-ab9ff700000058d7-22-5b78c80bd67f",
         "_else" => "ACCEPT|10.0.0.1:12345",
    "@timestamp" => 2019-04-21T04:14:15.872Z,
          "host" => "feeder.acme.com",
       "message" => "Aug 18 01:02:03 asd ecelerity[12345]: 1534642187|0a8018c9-ab9ff700000058d7-22-5b78c80bd67f|ACCEPT|10.0.0.1:12345",
          "date" => "1534642187"
}

According to documentation logstash should break on first match in this grok filter but is going further and matches only on last.

This code used to work in logstash 6.x. I didn't find anything interesting in breaking changes so maybe I'm missing something... or its a bug indeed.

Cheers

Badger · April 21, 2019, 10:07pm

That is all working as expected. What you appear not to realize is that

grok {
    match => { "message" => "%{somePattern}"}
    match => { "message" => "%{anotherPattern}"}
    match => { "message" => "%{aThirdPattern}" }
}

is equivalent to

grok {
    match => { "message" => "%{aThirdPattern}" }
}

When there are multiple match options for the same field only the last is implemented. The others are ignored.

ponquersohn · April 21, 2019, 10:10pm

But it used to work in 6.x:

# logstash -f /etc/logstash/conf.d/smg/test.logstash.conf --config.reload.automatic
WARNING: Could not find logstash.yml which is typically located in $LS_HOME/config or /etc/logstash. You can specify the path using --path.settings. Continuing using the defaults
Could not find log4j2 configuration at path /usr/share/logstash/config/log4j2.properties. Using default config which logs errors to the console
[WARN ] 2019-04-21 00:37:51.168 [LogStash::Runner] multilocal - Ignoring the 'pipelines.yml' file because modules or command line options are specified
[INFO ] 2019-04-21 00:37:51.187 [LogStash::Runner] runner - Starting Logstash {"logstash.version"=>"6.7.1"}
[INFO ] 2019-04-21 00:38:01.380 [Converge PipelineAction::Create<main>] pipeline - Starting pipeline {:pipeline_id=>"main", "pipeline.workers"=>4, "pipeline.batch.size"=>125, "pipeline.batch.delay"=>50}
[INFO ] 2019-04-21 00:38:01.750 [Converge PipelineAction::Create<main>] pipeline - Pipeline started successfully {:pipeline_id=>"main", :thread=>"#<Thread:0x1e8dce21 sleep>"}
[INFO ] 2019-04-21 00:38:01.910 [Ruby-0-Thread-1: /usr/share/logstash/lib/bootstrap/environment.rb:6] agent - Pipelines running {:count=>1, :running_pipelines=>[:main], :non_running_pipelines=>[]}
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/awesome_print-1.7.0/lib/awesome_print/formatters/base_formatter.rb:31: warning: constant ::Fixnum is deprecated
{
        "action" => "EHLO",
      "@version" => "1",
          "ehlo" => {
        "ehlo_host" => "mx1.asd.dd",
             "date" => "1534642187"
    },
           "uid" => "0a8018c9-ab9ff700000058d7-22-5b78c80bd67f",
    "@timestamp" => 2019-04-21T04:38:01.881Z,
       "message" => "Aug 18 01:02:03 asd bmserver[12345]: 1534642187|0a8018c9-ab9ff700000058d7-22-5b78c80bd67f|EHLO|mx1.asd.dd",
          "host" => "feeder.acme.com",
      "sequence" => 0
}
{
        "accept" => {
                  "date" => "1534642187",
        "accept_ip_port" => "10.0.0.1:12345",
             "accept_ip" => "10.0.0.1",
           "accept_port" => "12345"
    },
        "action" => "ACCEPT",
      "@version" => "1",
           "uid" => "0a8018c9-ab9ff700000058d7-22-5b78c80bd67f",
    "@timestamp" => 2019-04-21T04:38:01.858Z,
       "message" => "Aug 18 01:02:03 asd ecelerity[12345]: 1534642187|0a8018c9-ab9ff700000058d7-22-5b78c80bd67f|ACCEPT|10.0.0.1:12345",
          "host" => "feeder.acme.com",
      "sequence" => 0
}
[INFO ] 2019-04-21 00:38:02.859 [Api Webserver] agent - Successfully started Logstash API endpoint {:port=>9600}
[INFO ] 2019-04-21 00:38:02.941 [[main]-pipeline-manager] pipeline - Pipeline has terminated {:pipeline_id=>"main", :thread=>"#<Thread:0x1e8dce21 run>"}

Badger · April 21, 2019, 10:19pm

I am running 6.7.1 and it does not work in that.

ponquersohn · April 21, 2019, 10:38pm

My previous example was from 6.7.1 and logs show its working.

I couldn't find anything about:

When there are multiple match options for the same field only the last is implemented. The others are ignored.

in documentation. Regardless if previous example worked in 6.x or not this:

input {
    generator { count => 1 lines => [ 'Aug 18 01:02:03 asd ecelerity[12345]: 1534642187|0a8018c9-ab9ff700000058d7-22-5b78c80bd67f|ACCEPT|10.0.0.1:12345', 'Aug 18 01:02:03 asd bmserver[12345]: 1534642187|0a8018c9-ab9ff700000058d7-22-5b78c80bd67f|EHLO|mx1.asd.dd' ] }
}

filter { 
    grok {
        break_on_match => true
        pattern_definitions => {
            "UID" => "[0-9a-f]{8}-[0-9a-f]{16}-[0-9a-f]{2}-[0-9a-f]{12}"
            "ACCEPT" => "ACCEPT"
            "ACCEPTIPPORT" => "%{IP:accept_ip}:%{NUMBER:accept_port}"
            "ACCEPTIPPORT_OBJ" => "%{IP:[accept][accept_ip]}:%{NUMBER:[accept][accept_port]}"
            "ACCEPT_OBJ_PTRN" => "%{NUMBER:[accept][date]}\|%{UID:uid}\|%{ACCEPT:action}\|%{ACCEPTIPPORT_OBJ:[accept][accept_ip_port]}"
            "EHLO" => "EHLO"
            "EHLO_OBJ_PTRN" => "%{NUMBER:[ehlo][date]}\|%{UID:uid}\|%{EHLO:action}\|%{GREEDYDATA:[ehlo][ehlo_host]}"


        }
        match => { "message" => ["%{ACCEPT_OBJ_PTRN}", "%{EHLO_OBJ_PTRN}"] }
        #match => { "message" => "%{NUMBER:date}\|%{UID:uid}\|%{GREEDYDATA:_else}"}
    }

    if [_else] {
    # 	drop {}
        mutate {add_field => { "action" => "PARSE_ERROR" }}
   } 
}


output {
	stdout { codec => rubydebug }
}

Works in 6.x and in 7.0:

# logstash -f /etc/logstash/conf.d/smg/test.logstash.conf --config.reload.automatic
WARNING: Could not find logstash.yml which is typically located in $LS_HOME/config or /etc/logstash. You can specify the path using --path.settings. Continuing using the defaults
Could not find log4j2 configuration at path /usr/share/logstash/config/log4j2.properties. Using default config which logs errors to the console
[WARN ] 2019-04-21 00:53:01.216 [LogStash::Runner] multilocal - Ignoring the 'pipelines.yml' file because modules or command line options are specified
[INFO ] 2019-04-21 00:53:01.229 [LogStash::Runner] runner - Starting Logstash {"logstash.version"=>"7.0.0"}
[INFO ] 2019-04-21 00:53:09.881 [[main]-pipeline-manager] javapipeline - Starting pipeline {:pipeline_id=>"main", "pipeline.workers"=>4, "pipeline.batch.size"=>125, "pipeline.batch.delay"=>50, "pipeline.max_inflight"=>500, :thread=>"#<Thread:0x832a9c4 run>"}
[INFO ] 2019-04-21 00:53:10.000 [[main]-pipeline-manager] javapipeline - Pipeline started {"pipeline.id"=>"main"}
[INFO ] 2019-04-21 00:53:10.146 [Ruby-0-Thread-1: /usr/share/logstash/lib/bootstrap/environment.rb:6] agent - Pipelines running {:count=>1, :running_pipelines=>[:main], :non_running_pipelines=>[]}
[INFO ] 2019-04-21 00:53:10.706 [Api Webserver] agent - Successfully started Logstash API endpoint {:port=>9600}
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/awesome_print-1.7.0/lib/awesome_print/formatters/base_formatter.rb:31: warning: constant ::Fixnum is deprecated
{
           "uid" => "0a8018c9-ab9ff700000058d7-22-5b78c80bd67f",
       "message" => "Aug 18 01:02:03 asd ecelerity[12345]: 1534642187|0a8018c9-ab9ff700000058d7-22-5b78c80bd67f|ACCEPT|10.0.0.1:12345",
        "accept" => {
             "accept_ip" => "10.0.0.1",
           "accept_port" => "12345",
        "accept_ip_port" => "10.0.0.1:12345",
                  "date" => "1534642187"
    },
    "@timestamp" => 2019-04-21T04:53:10.115Z,
          "host" => "feeder.acme.com",
        "action" => "ACCEPT",
      "@version" => "1",
      "sequence" => 0
}
{
           "uid" => "0a8018c9-ab9ff700000058d7-22-5b78c80bd67f",
       "message" => "Aug 18 01:02:03 asd bmserver[12345]: 1534642187|0a8018c9-ab9ff700000058d7-22-5b78c80bd67f|EHLO|mx1.asd.dd",
    "@timestamp" => 2019-04-21T04:53:10.171Z,
          "host" => "feeder.acme.com",
          "ehlo" => {
        "ehlo_host" => "mx1.asd.dd",
             "date" => "1534642187"
    },
        "action" => "EHLO",
      "@version" => "1",
      "sequence" => 0
}

It seems its all about notation but I liked the previous one better... it was somehow more readable.

Cheers,

paul_nzl · April 22, 2019, 9:03pm

Hi,

I have to agree with @ponquersohn. We are seeing the behaviour of 'break_on_match' working as we would expect up until version 6.7.1. It is in version 7 that it stops working.

To me, the fact that the parameter is called 'break_on_match' implies first match rather than last match. To avoid ambiguity maybe the parameter should be renamed 'break_on_last_match', and then a 'break_on_first_match' could be implemented? Or, the documentation could be explicit on the behaviour.

So, if it is in fact break on last match, will reordering the patterns in the GROK filter resolve the issue? I.e. put the more generic patterns first working down to the specific patterns.

Thanks
Paul

Badger · April 22, 2019, 9:17pm

I get it now. It's a pipeline.java_execution thing. That is on by default in 7. I've had it enabled since it was experimental. That's why 6.x behaves differently for me.

ponquersohn · April 22, 2019, 9:26pm

@paul_nzl
reordering wont help but changing the notation will.
This will continue until every pattern is checked:

grok {
    match => { "message" => "%{somePattern}"}
    match => { "message" => "%{anotherPattern}"}
    match => { "message" => "%{aThirdPattern}" }
}

and this will break_on_match:

grok {
    match => { "message" => [ "%{somePattern}", "%{anotherPattern}", "%{aThirdPattern}"] }
}

I still believe it is a bug but workaround works for me.

Cheers,

paul_nzl · April 22, 2019, 9:52pm

@Badger,

Thanks. I have just explicitly set 'pipeline.java_execution' to be 'false', reverted the logstash.conf file to have the GROK filters as defined for v6.7 and it is now working as expected.

So, would appear to be a difference between the jruby and java implementations for GROK?

@ponquersohn - thanks, might give that a try as well. Guess comes down to whether we are overly concerned around the performance of jruby vs java execution engines.

Cheers
Paul

system · May 20, 2019, 9:53pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Break_on_match not working Logstash	4	480	July 5, 2019
Broken Grok filter since V7.10.0 Logstash	3	296	February 23, 2022
Logstash Grok break_on_filter not working Logstash	1	306	March 29, 2018
Will my "%{GREEDYDATA:[log_message]}" overwrides all my other grok filters? Logstash	2	378	February 1, 2021
ES grok processor break_on_match => false needed Elasticsearch	3	3145	September 19, 2017

Logstash 7 GROK Filter plugin 'break_on_match' appears to be broken

Related topics