Determining reason for stuck grok filter

Hi all,

I've tried using logstash versions 6.8.1, 7.0 and 7.2 and, regardless of version, logstash seems to stop it's processing and begin consuming all the available CPU. It seems to be stuck with the grok filter. I suspect this is the case, because when I shut down logstash, I get messages regarding the org.logstash.execution.ShutdownWatcherExt and stalling_threads_info that contain my actual grok patterns that I'm using.

In addition, the tail end of those stalling_threads_info messages look like this:

"name"=>"[main]>worker0", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/jls-grok-0.11.5/lib/grok-pure.rb:182:in \match'"}, {"thread_id"=>28, "name"=>"[main]>worker1", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/jls-grok-0.11.5/lib/grok-pure.rb:182:in `match'"}]}}`

Is there someway to determine, what input/log event is causing the regular expression, or pattern, to go crazy like this? Is there a recommended approach to resolve issues such as this?

What does your grok pattern look like? I suspect that there is excessive backtracking occurring.

Here is one. The YESNO is a custom pattern definition that's defined as "yes|no|YES|NO":

%{FASP_DATE:timestamp} %{WORD:hostname} ascp[%{POSINT:pid}]: LOG FASP Session Params uuid=%{UUID:uuid} userid=%{NUMBER:userid} user="%{WORD:remote_user}" targetrate=%{NONNEGINT:targetrate} minrate=%{INT:minrate} rate_policy=%{WORD:rate_policy} cipher=%{DATA:cipher} resume=%{INT:resume} create=%{INT:create} ovr=%{INT:ovr} times=%{INT:times} precalc=%{WORD:precalc} mf=%{INT:mf} mf_path=%{DATA:mf_path} mf_suffix=%{DATA:mf_suffix} partial_file_suffix=%{DATA:partial_file_suffix} files_encrypt=%{YESNO:files_encrypt} files_decrypt=%{YESNO:files_decrypt} file_csum=%{WORD:file_csum} dgram_sz=%{NUMBER:dgram_sz} prepostcmd=%{DATA:prepostcmd} tcp_mode=%{YESNO:tcp_mode} rtt_auto=%{YESNO:rtt_auto} cookie="%{DATA:cookie}" vl_proto_ver=%{NUMBER:vl_proto_ver} peer_vl_proto_ver=%{NUMBER:peer_vl_proto_ver} vl_local=%{NUMBER:vl_local} vlink_remote=%{NUMBER:vlink_remote} vl_sess_id=%{DATA:vl_sess_id} srcbase=%{DATA:srcbase} rd_sz=%{NUMBER:rd_size} wr_sz=%{NUMBER:wr_sz} cluster_num_nodes=%{NONNEGINT:cluster_num_nodes} cluster_node_id=%{INT:cluster_node_id} cluster_multi_session_threshold=%{INT:cluster_multi_session_threshold} range=%{DATA:range} keepalive=%{YESNO:keepalive} test_login=%{YESNO:test_login} proxy_ip=%{DATA:proxy_ip} net_rc_alg=%{WORD:net_rc_alg} exclude_older/newer_than=%{NUMBER:exclude_older}/%{NUMBER:newer_than}

I do have another one that makes use of multiple "GREEDYDATA" fields...

%{FASP_DATE:timestamp} %{WORD:hostname} ascp[%{POSINT:pid}]: LOG FASP Session Stop uuid=%{UUID:uuid} op=%{WORD:operation} status=%{FASP_STATUS:status} errcode=%{NUMBER:errcode} errstr="%{GREEDYDATA:errstr}" source=%{DATA:source} dest=%{DATA:dest} source_prefix=%{DATA:source_prefix} local=%{IP:local_ip}:%{NUMBER:port} peer=%{IP:client_ip}:%{NUMBER:client_port} tcp_port=%{NUMBER:tcp_port} os="%{GREEDYDATA:os}" ver=%{DATA:ver} lic=%{DATA:lic} peeros="%{GREEDYDATA:client_os}" peerver=%{DATA:client_ver} peerlic=%{DATA:client_lic} proto_sess=%{NUMBER:proto_sess} proto_udp=%{NUMBER:proto_udp} proto_bwmeas=%{NUMBER:proto_bwmeas} proto_data=%{WORD:proto_data}

That first pattern has 9 DATA patterns in it. It is going to be very expensive for it to determine that it does not match a field. Whereever possible, replace DATA with something else. If a field cannot contain a space then replace %{DATA:cipher} with

(?<cipher>[^ ]+)

and so on. Ideally you should only have one DATA or GREEDYDATA and it should be the last item in the pattern.

Also, most importantly, anchor your patterns. Read this elastic blog entry. If you are matching the whole of a line there is a massive performance difference between the cost of

%{FASP_DATE:timestamp} %{WORD:hostname} ascp[%{POSINT:pid}]: LOG FASP Session Params uuid=%{UUID:uuid} userid=%{NUMBER:userid} user="%{WORD:remote_user}"

and

^%{FASP_DATE:timestamp} %{WORD:hostname} ascp[%{POSINT:pid}]: LOG FASP Session Params uuid=%{UUID:uuid} userid=%{NUMBER:userid} user="%{WORD:remote_user}"

when they do not match.

Also, grok looks like the wrong approach to me. I would try using to grok with this

^%{FASP_DATE:timestamp} %{WORD:hostname} ascp[%{POSINT:pid}]: LOG FASP Session %{WORD:operation} %{GREEDYDATA:restOfLine}

and then use a kv filter to parse restOfLine.

1 Like

Thanks very much. You've given me a whole bunch of leads to track down. Some outstanding recommendations and tips here. Thank you!

So far I've eliminated the use of DATA and GREEDYDATA to see if there is an improvement in the CPU consumption, but so far the situation is unchanged. I've also added a timeout_millis setting (set to 300) in order to skip events that cause the regular expression engine to take too long. Unfortunately, this ALSO doesn't seem to have any effect and the CPU still get's completely consumed. I would have expected to start seeing the events in my input (redis) go down over time since I've asked logstash to skip slow events, but it just stalls after several hundred events are removed. Any suggestions about what to try next?

Are you absolutely sure you are running with the modified configuration?

Just to be sure, I introduced an obvious typo into the configuration and restarted. Sure enough, a LogStash::ConfigurationError was thrown in the logs. So, since I have two inputs, I decided to turn one of them off by commenting it out. That helped quite a bit and the CPU usage came down to the single digits. Therefore, the other input (redis) must contain the events that are causing the high CPU condition, which I confirmed by reactivating it and observing 100% CPU again... I suppose the next thing to try is the kv filter technique you suggested earlier, but I'm a little disheartened that the timeout_millis setting didn't seem to do anything...

Looks like removing the multitude of grok patterns and using just one followed by the use of the kv filter, was the key to resolving this problem. Many thanks!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.