Regex performance with logstash using more general match

vinci · October 31, 2020, 11:09pm

Hi,

I was wondering about regex performance in general and in particular with logstash and grok.
I'm currently trying to parse firewall logs sent through the syslog protocol which mainly contains keys and values (a key would be i.e. "id" and its value "2001").
This is how the rule looks like (redacted):

2020:10:31-23:43:52 firewall-state myservice[19509]: id="2001" severity="info" sys="SecureNet" sub="packetfilter" name="Packet dropped" action="drop" fwrule="60001" initf="eth1" srcmac="00:11:22:33:44:55" dstmac="00:11:22:33:44:55" srcip="192.168.1.1" dstip="192.168.2.1" proto="6" length="40" tos="0x00" prec="0x00" ttl="251" srcport="48234" dstport="16432" tcpflags="SYN"

So I've added a custom grok variable for the timestamp as the one provided by default by grok doesn't work:

CUSTOM_TIMESTAMP %{YEAR:year}:%{MONTHNUM2:month}:%{MONTHDAY:monthday}-%{HOUR:hour}:%{MINUTE:minute}:%{SECOND:second}

Someone suggested that I don't bother with the exact format of the values, because, as long as I'm not forcing any backtracking (using .* for instance) or something to that effect, it should be just fine to match any type of characters that are included in the brackets.
So he proposed adding this custom variable:

STRING [^"]*

And then added this pattern:

"%{CUSTOM_TIMESTAMP:timestamp} %{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:\[%{POSINT:syslog_pid}\])?: id=\"%{STRING:service_id}\"(?: severity=\"%{STRING:service_severity}\")?(?: sys=\"%{STRING:service_sys}\")?(?: sub=\"%{STRING:service_sub}\")?(?: name=\"%{STRING:service_name}\")?(?: action=\"%{STRING:service_action}\")?(?: fwrule=\"%{STRING:service_fwrule}\")?(?: initf=\"%{STRING:service_initf}\")?(?: srcmac=\"%{STRING:service_srcmac}\")?(?: dstmac=\"%{STRING:service_dstmac}\")?(?: srcip=\"%{STRING:service_srcip}\")?(?: dstip=\"%{STRING:service_dstip}\")?(?: proto=\"%{STRING:service_proto}\")?(?: length=\"%{STRING:service_length}\")?(?: tos=\"%{STRING:service_tos}\")?(?: prec=\"%{STRING:service_prec}\")?(?: ttl=\"%{STRING:service_ttl}\")?(?: srcport=\"%{STRING:service_srcport}\")?(?: dstport=\"%{STRING:service_dstport}\")?(?: tcpflags=\"%{STRING:service_tcpflags}\")?"

So, as you can see, any string that is quoted (the values to the 'keys') is matched by the regex [^"]*

The pattern itself works (the optional regex added by ? are needed, because some key-values pairs sometimes don't show up), so I'm not worried about that. But my question is, what should I take into consideration performance-wise?
Initially I started writing the regex by being more exact and making use (when that was available) of the already existing grok patterns (mac addresses, ipv6, whatever).

Is it worth doing that and being very exact in this context? Does it bring any value/performance with it? Or is it negligible?
From what I've generally read, I gather that the more exact are the regex, the better is the performance. But oftetimes the examples were too obviously, such as using [0-9]+ instead of .*. I'm not sure if this also applies to more, let's say, subtle situations.

Is there anywhere I could test this under 'lab' condition, as it were?

I'm looking forward to your answer!

Badger · November 1, 2020, 12:12am

Please forgive me for ignoring your actual question, but I have to say I would approach a log entry like that using dissect to take off the timestamp, hostname, service name and pid, then use a kv filter to parse the rest. Something like this.

With all those optional fields the problem is not messages that match, it is messages that do not match. Every optional field will result in back-tracking over other optional fields. When you get a message that does not match then it will be really expensive.

If you want to measure the cost then a simple approach would be just to measure the CPU time burnt by logstash when you run it to read a file large enough to cause the CPU time to be dominated by processing rather than startup (and startup can be a minute or more).

It might require more work, but if you configure monitoring then you can see the milliseconds used by each filter. You can connect logstash to kibana using metricbeat to visualize that data for easier consumption.

For example, feed several thousand lines through logstash with a configuration like

filter {
    ruby { code => 'event.set("[@metadata][random]", rand(1..3))' }
    if [@metadata][random] == "1" {
          grok { ... }
    } else if  [@metadata][random] == "2" {
        dissect { ... }
        kv { ... }
    } else {
         # Some other choice of filters.
    }

Obviously it does not have to be 3 choices. It could be 2 options to just eliminate options or several differently tuned filter sets.

If you want to compare two different configurations of the same filter (e.g. grok) then you will definitely want to use the id option on one or more to help you see which is which.

vinci · November 2, 2020, 11:45am

Thanks for the relevant answer. Of course it helps. I had no idea of dissect or kv.

I'm not sure though how I would solve it further on if I wanted to name the keys differently. This is just the naming of one firewall, but if I wanted to properly agggregate logs when searching for them, I would want to have the same naming structure for different firewalls models (some might call it ipsrc, others source or whatever).
Would you recommend transforming the fields afterwards, so adding yet another filter line, for instance?

Your remark regarding the performance is on the mark, so that's at least partially what I wanted to understand.
So I got that straight, then should I be bothered by [^"]* at all and try to make it more accurate?

leandrojmp · November 2, 2020, 3:11pm

Firewall loglines usually keeps the same structure for a specific device model or vendor.

Since the structure of the log line does not change, there is no reason to use grok to parse the message, I would suggest the same thing that @Badger suggested, use a combination of dissect and kv.

Using dissect instead of grok will make your pipeline use less CPU and depending on the number of events per second, it could use a lot less CPU.

While grok can solve your problem and parse your messages, the performance can be better using other filters that do not rely on regular expressions.

Regarding the name of the keys, you can normalize your fields using mutate filters to rename then. I would also suggest that you use the Elastic Common Schema when naming your fields.

This way you would rename both ipsrc and source to source.address.

vinci · November 3, 2020, 9:40pm

Thank you both for your answer, this seems to be the most reasonable solution at this point.
I have another question though, regarding logstash in general and grok, in this respect. I just want it to know nonetheless, because I'm bound to use grok anyway at some point, and I'd like to understand how this works.
If I have two match statements one after another, how does this work actually? Just in this order? First match, first parsed? So I'd generally have to make sure that I places the lines which are more probably to get matched first?

Badger · November 3, 2020, 9:48pm

The match option is an array, and arrays are ordered. grok will attempt to match in the order you give them, so the most specific should come first.

If you use the break_on_match option you can make it try to match all of them.

vinci · November 3, 2020, 9:52pm

But what's the point of trying the following patterns if there is already a match for the first one?
I'm not sure Iif follow the logic of break_on_match (I've just read the documentation)

Badger · November 3, 2020, 10:11pm

Suppose you have a message like

some text UserId:foo other indeterminate text Result:failed

and you want to extract the user and result. You could use

match => {
    "message" => [
        "UserId: %{WORD:user}",
        "Result: %{WORD:result}"
     ]
}
break_on_match => false

which would extract both. If one were missing it would still extract the other.

You could achieve the same using a single pattern with a %{DATA} pattern between those two, and surround them each with ( )? to make them optional, but that is harder to understand and costs more (DATA and especially GREEDYDATA are expensive when the pattern does not match).

It is not an option I use much, but it does come in handy sometimes.

vinci · November 3, 2020, 10:25pm

Would it make sense, then, to actually use two different match directives?
such as:

match => { "message" => %{whatever....}
match => {"message" => %{someotherthing}

Is that proper syntax? Or should everthing be placed into this one array?

Badger · November 3, 2020, 10:34pm

I never use multiple occurrences of an option. logstash will combine them, and it almost always does it in the way you would expect. And then, one day, it will bite you, by doing something unexpected.

system · December 1, 2020, 10:34pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Custom grok pattern issue Logstash	5	475	June 5, 2018
Syslog grok patterns and pipeline performance Logstash	4	3339	March 8, 2017
Grok tests pass, logstash still borks with "_grokparsefailure" Logstash	7	1466	July 6, 2017
Grok match and divide string to multiple fields Logstash	6	1135	August 3, 2018
Grok is succesfull in various debuggers but fails in practice with specific keyword Logstash	3	630	August 13, 2018

Regex performance with logstash using more general match

Related topics