Grok increase CPU usage to 700+%

Hello team,

we have implemented a following GROK multi pattern match but when we deploy the pipeline for this component it raises CPU from 100 to 700%.

Is there a way how to improve its performance?
One way would be to add more logstash nodes (we have 2).

grok {
            match => {
                "[message]" => [
                    "^%{IP:[http][remoteIp]} %{HOSTNAME:[http][sourceHostname]} %{NOTSPACE} \[%{HTTPDATE:[http][timeStamp]}\] \"%{WORD:[http][method]} %{URIPATH:[http][uri]} %{DATA:[http][version]}\" %{INT:[http][responseCode]} (?:%{INT:[http][responseSize]}|-) \"%{DATA:[http][referer]}\" \"%{DATA:[http][userAgent]}\" %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} rtm:%{NOTSPACE}/%{NUMBER:[http][responseTimeMs]} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} hct:\"%{DATA:[http][contentType]}\" %{NOTSPACE} hac:\"%{DATA:[http][acceptHeaders]}\" hxa:\"%{DATA:[http][requestHeaders]}\" %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} sve:%{NOTSPACE:[http][tlsVersion]} sci:%{NOTSPACE:[http][cipher]} %{GREEDYDATA}$",
                    "^%{IP:[http][remoteIp]} %{HOSTNAME:[http][sourceHostname]} %{NOTSPACE} \[%{HTTPDATE:[http][timeStamp]}\] \"%{WORD:[http][method]} %{URIPATH:[http][uri]}%{URIPARAM:[http][queryString]} %{DATA:[http][version]}\" %{INT:[http][responseCode]} (?:%{INT:[http][responseSize]}|-) \"%{DATA:[http][referer]}\" \"%{DATA:[http][userAgent]}\" %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} rtm:%{NOTSPACE}/%{NUMBER:[http][responseTimeMs]} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} hct:\"%{DATA:[http][contentType]}\" %{NOTSPACE} hac:\"%{DATA:[http][acceptHeaders]}\" hxa:\"%{DATA:[http][requestHeaders]}\" %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} sve:%{NOTSPACE:[http][tlsVersion]} sci:%{NOTSPACE:[http][cipher]} %{GREEDYDATA}$",
                    "^%{IP:[http][remoteIp]} %{HOSTNAME:[http][sourceHostname]} %{NOTSPACE} \[%{HTTPDATE:[http][timeStamp]}\] \"%{WORD:[http][method]} %{GREEDYDATA:[http][uri]} %{DATA:[http][version]}\" %{INT:[http][responseCode]} (?:%{INT:[http][responseSize]}|-) \"%{DATA:[http][referer]}\" \"%{DATA:[http][userAgent]}\" %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} rtm:%{NOTSPACE}/%{NUMBER:[http][responseTimeMs]} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} hct:\"%{DATA:[http][contentType]}\" %{NOTSPACE} hac:\"%{DATA:[http][acceptHeaders]}\" hxa:\"%{DATA:[http][requestHeaders]}\" %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} sve:%{NOTSPACE:[http][tlsVersion]} sci:%{NOTSPACE:[http][cipher]} %{GREEDYDATA}$",
                    "^%{IP:[http][remoteIp]} %{HOSTNAME:[http][sourceHostname]} %{NOTSPACE} \[%{HTTPDATE:[http][timeStamp]}\] \"%{WORD:[http][method]} %{URIPATH:[http][uri]} %{DATA:[http][version]}\" %{INT:[http][responseCode]} (?:%{INT:[http][responseSize]}|-) \"%{DATA:[http][referer]}\" \"%{DATA:[http][userAgent]}\" %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} rtm:%{NOTSPACE}/%{NUMBER:[http][responseTimeMs]} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} hct:\"%{DATA:[http][contentType]}\" hco:%{GREEDYDATA} hac:\"%{DATA:[http][acceptHeaders]}\" hxa:\"%{DATA:[http][requestHeaders]}\" %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} sve:%{NOTSPACE:[http][tlsVersion]} sci:%{NOTSPACE:[http][cipher]} %{GREEDYDATA}$",
                    "^%{IP:[http][remoteIp]} %{HOSTNAME:[http][sourceHostname]} %{NOTSPACE} \[%{HTTPDATE:[http][timeStamp]}\] \"%{WORD:[http][method]} %{URIPATH:[http][uri]}%{URIPARAM:[http][queryString]} %{DATA:[http][version]}\" %{INT:[http][responseCode]} (?:%{INT:[http][responseSize]}|-) \"%{DATA:[http][referer]}\" \"%{DATA:[http][userAgent]}\" %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} rtm:%{NOTSPACE}/%{NUMBER:[http][responseTimeMs]} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} hct:\"%{DATA:[http][contentType]}\" hco:%{GREEDYDATA} hac:\"%{DATA:[http][acceptHeaders]}\" hxa:\"%{DATA:[http][requestHeaders]}\" %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} %{NOTSPACE} sve:%{NOTSPACE:[http][tlsVersion]} sci:%{NOTSPACE:[http][cipher]} %{GREEDYDATA}$"
                ]
            }
            remove_field => [ "message" ]
            tag_on_failure => [ "_grokparsefailure", "_gefeparsefailure" ]
        }

Yes.

Regexps can be cheap when they match, but expensive when they fail to match. This is especially true of GREEDYDATA. You can make your regexps cheaper.

Firstly, you are parsing the same prefix in every pattern. You can parse that once so that you avoid the cost of repeatedly parsing it when the rest of the expression fails to match.

grok { match => { "[message]" =>  "^%{IP:[http][remoteIp]} %{HOSTNAME:[http][sourceHostname]} %{NOTSPACE} \[%{HTTPDATE:[http][timeStamp]}\] %{GREEDYDATA:restOfLine}" } }

The match [restOfLine] against your list of anchored patterns.

Secondly, you can use dissect instead of grok, which is much less flexible and therefore cheaper.

dissect { mapping => { "[message]" =>  "%{[http][remoteIp]} %{[http][sourceHostname]} %{} [%{http][timeStamp]}] %{restOfLine}" } }

Thirdly, you are using an array of patterns to match the request. You can pull out the request by itself and then use a separate grok to match these patterns against it. Either

dissect { mapping => { "restOfLine" => '"%{httpRequest}" %{everythingElse}' } }

or

grok { match => { "restOfLine" => '"(?<httpRequest>[^"]*)" %{GREEDYDATA:everythingElse}" } }

You could then use another dissect or grok to take off status, size, referer and user agent (again, replacing DATA with QS or the custom pattern used for httpRequest).

If I am reading it correctly you then have a variable section followed by rtm: and a fixed set of stuff. You could pull out the fixed section using dissect

dissect { mapping => { "everythingElse" => "%{beforeRtm} rtm: %{afterRtm}" } }

or grok

grok { match => { "everythingElse => "%{DATA:beforeRtm} rtm: %GREEDYDATA:afterRtm}" } }

then match your list of variable patterns just against the much short [beforeRtm] field.

1 Like

Many thanks! Just after first split to common part it dropped to 300% :-))

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.