Hi,
I was wondering about regex performance in general and in particular with logstash and grok.
I'm currently trying to parse firewall logs sent through the syslog protocol which mainly contains keys and values (a key would be i.e. "id" and its value "2001").
This is how the rule looks like (redacted):
2020:10:31-23:43:52 firewall-state myservice[19509]: id="2001" severity="info" sys="SecureNet" sub="packetfilter" name="Packet dropped" action="drop" fwrule="60001" initf="eth1" srcmac="00:11:22:33:44:55" dstmac="00:11:22:33:44:55" srcip="192.168.1.1" dstip="192.168.2.1" proto="6" length="40" tos="0x00" prec="0x00" ttl="251" srcport="48234" dstport="16432" tcpflags="SYN"
So I've added a custom grok variable for the timestamp as the one provided by default by grok doesn't work:
CUSTOM_TIMESTAMP %{YEAR:year}:%{MONTHNUM2:month}:%{MONTHDAY:monthday}-%{HOUR:hour}:%{MINUTE:minute}:%{SECOND:second}
Someone suggested that I don't bother with the exact format of the values, because, as long as I'm not forcing any backtracking (using .*
for instance) or something to that effect, it should be just fine to match any type of characters that are included in the brackets.
So he proposed adding this custom variable:
STRING [^"]*
And then added this pattern:
"%{CUSTOM_TIMESTAMP:timestamp} %{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:\[%{POSINT:syslog_pid}\])?: id=\"%{STRING:service_id}\"(?: severity=\"%{STRING:service_severity}\")?(?: sys=\"%{STRING:service_sys}\")?(?: sub=\"%{STRING:service_sub}\")?(?: name=\"%{STRING:service_name}\")?(?: action=\"%{STRING:service_action}\")?(?: fwrule=\"%{STRING:service_fwrule}\")?(?: initf=\"%{STRING:service_initf}\")?(?: srcmac=\"%{STRING:service_srcmac}\")?(?: dstmac=\"%{STRING:service_dstmac}\")?(?: srcip=\"%{STRING:service_srcip}\")?(?: dstip=\"%{STRING:service_dstip}\")?(?: proto=\"%{STRING:service_proto}\")?(?: length=\"%{STRING:service_length}\")?(?: tos=\"%{STRING:service_tos}\")?(?: prec=\"%{STRING:service_prec}\")?(?: ttl=\"%{STRING:service_ttl}\")?(?: srcport=\"%{STRING:service_srcport}\")?(?: dstport=\"%{STRING:service_dstport}\")?(?: tcpflags=\"%{STRING:service_tcpflags}\")?"
So, as you can see, any string that is quoted (the values to the 'keys') is matched by the regex [^"]*
The pattern itself works (the optional regex added by ?
are needed, because some key-values pairs sometimes don't show up), so I'm not worried about that. But my question is, what should I take into consideration performance-wise?
Initially I started writing the regex by being more exact and making use (when that was available) of the already existing grok patterns (mac addresses, ipv6, whatever).
Is it worth doing that and being very exact in this context? Does it bring any value/performance with it? Or is it negligible?
From what I've generally read, I gather that the more exact are the regex, the better is the performance. But oftetimes the examples were too obviously, such as using [0-9]+
instead of .*
. I'm not sure if this also applies to more, let's say, subtle situations.
Is there anywhere I could test this under 'lab' condition, as it were?
I'm looking forward to your answer!