Regex performance with logstash using more general match

Please forgive me for ignoring your actual question, but I have to say I would approach a log entry like that using dissect to take off the timestamp, hostname, service name and pid, then use a kv filter to parse the rest. Something like this.

With all those optional fields the problem is not messages that match, it is messages that do not match. Every optional field will result in back-tracking over other optional fields. When you get a message that does not match then it will be really expensive.

If you want to measure the cost then a simple approach would be just to measure the CPU time burnt by logstash when you run it to read a file large enough to cause the CPU time to be dominated by processing rather than startup (and startup can be a minute or more).

It might require more work, but if you configure monitoring then you can see the milliseconds used by each filter. You can connect logstash to kibana using metricbeat to visualize that data for easier consumption.

For example, feed several thousand lines through logstash with a configuration like

filter {
    ruby { code => 'event.set("[@metadata][random]", rand(1..3))' }
    if [@metadata][random] == "1" {
          grok { ... }
    } else if  [@metadata][random] == "2" {
        dissect { ... }
        kv { ... }
    } else {
         # Some other choice of filters.
    }

Obviously it does not have to be 3 choices. It could be 2 options to just eliminate options or several differently tuned filter sets.

If you want to compare two different configurations of the same filter (e.g. grok) then you will definitely want to use the id option on one or more to help you see which is which.

1 Like