Grok The correct syntax to match against multiple patterns

I have an event whose message field I want to match against multiple patterns, if the message matches any of the patterns, then simply stop searching the remaining patterns in the list/array and write to output.

For reference this is all I could find on the site.

It says

If you need to match multiple patterns against a single field, the value can be an array of patterns:

The issue here is that both of the following filters work , by work I mean Logstash doesn't error out, so they're both syntactically correct.

What exactly is the difference between the following 2 filters ?

FILTER 1

filter {
    if [fields][component][0] in ["data_vm"]  {
        grok {
            patterns_dir => "/usr/share/logstash/patterns"
            match => ["message", "(?m)^%{TIMESTAMP_ISO8601:date}%{SPACE}%{LOGLEVEL:loglevel}%{DATA:tr}%{SPACE}-%{SPACE}", "(?m)^%{TIMESTAMP_ISO8601:date}%{DATA:tr}%{SPACE}:%{SPACE}", "(?m)^%{TIMESTAMP_ISO8601:date}%{SPACE}%{DATA:tr}%{DATA:logger_class}:%{SPACE}"]
            add_field => {"dhiwakar_new" => "%{message_r}"}
        }
    }
}

FILTER 2

filter {
    if [fields][component][0] in ["data_vm"]  {
        grok {
            patterns_dir => "/usr/share/logstash/patterns"
            match => { "message" => ["(?m)^%{TIMESTAMP_ISO8601:date}%{SPACE}%{LOGLEVEL:loglevel}%{SPACE}%{DATA:tr}%{SPACE}-%{SPACE}", "(?m)^%{TIMESTAMP_ISO8601:date}%{SPACE}%{DATA:tr}%{SPACE}:%{SPACE}", "(?m)^%{TIMESTAMP_ISO8601:date}%{SPACE}%{DATA:tr}%{SPACE}-%{SPACE}%{DATA:logger_class}:%{SPACE}"]}
            add_field => {"dhiwakar_new" => "%{message_r}"}
        }
    }
}

Good question! The match option on a grok filter is documented as taking a hash. But, as you have noticed, the configuration compiler will not complain if you give it an array instead. Generally, the compiler is not fussy about whether it gets a hash where an array is needed or an array where a hash is needed. It will just convert between the two.

The examples in the documentation of plugins routinely show the wrong format being used for options. For example, the default_keys option of a kv filter and the xpath option on a xml filter both show an array where a hash is needed.

So, how does it convert an array into a hash? It shifts the array members off the front of the array pairwise.

For example, if we have this configuration

input { generator { count => 1 lines => [ '' ] } }
output { stdout { } }
filter {
    mutate { add_field => [ "field1", "foo bar", "field2", "bing bang" ] }
}

we will get

    "field1" => "foo bar",
    "field2" => "bing bang"

If the array has an odd number of entries then you will get a configuration error "This field must contain an even number of items".

Now we can add a grok filter to match those two fields

    grok {
        break_on_match => false
        match => [ 
            "field1", "^foo %{WORD:result1}", 
            "field2", "^bing %{WORD:result2}"
        ]
    }

and we will get

   "result2" => "bang",
   "result1" => "bar",

So your first filter definition will be parsed as if it were

match => {
    "message" => "(?m)^%{TIMESTAMP_ISO8601:date}%{SPACE}..."
    "(?m)^%{TIMESTAMP_ISO8601:date}%{SPACE}..." => "(?m)^%{TIMESTAMP_ISO8601:date}%{SPACE}...%{GREEDYDATA:message_r}"
}

Of course the event.get("(?m)^%{TIMESTAMP_ISO8601:date}%{SPACE}..." will return nil, so the second entry in the match hash is a no-op.

Note that grok will try each of the patterns until it gets a match. That means the patterns must be ordered from most specific to least specific. Your pattern

"(?m)^%{TIMESTAMP_ISO8601:date}%{SPACE}%{LOGLEVEL:loglevel}%{SPACE}%{DATA:thread}%{SPACE}-%{SPACE}%{GREEDYDATA:message_r}"

is less specific than

"(?m)^%{TIMESTAMP_ISO8601:date}%{SPACE}%{LOGLEVEL:loglevel}%{SPACE}%{DATA:thread}%{SPACE}-%{SPACE}%{DATA:logger_class}:%{SPACE}%{GREEDYDATA:message_r}"

so the latter pattern never gets used. Make it the first pattern in the array.

1 Like