Determining reason for stuck grok filter

victor73 · July 18, 2019, 5:09pm

Hi all,

I've tried using logstash versions 6.8.1, 7.0 and 7.2 and, regardless of version, logstash seems to stop it's processing and begin consuming all the available CPU. It seems to be stuck with the grok filter. I suspect this is the case, because when I shut down logstash, I get messages regarding the org.logstash.execution.ShutdownWatcherExt and stalling_threads_info that contain my actual grok patterns that I'm using.

In addition, the tail end of those stalling_threads_info messages look like this:

"name"=>"[main]>worker0", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/jls-grok-0.11.5/lib/grok-pure.rb:182:in \match'"}, {"thread_id"=>28, "name"=>"[main]>worker1", "current_call"=>"[...]/vendor/bundle/jruby/2.5.0/gems/jls-grok-0.11.5/lib/grok-pure.rb:182:in `match'"}]}}`

Is there someway to determine, what input/log event is causing the regular expression, or pattern, to go crazy like this? Is there a recommended approach to resolve issues such as this?

Badger · July 18, 2019, 5:12pm

What does your grok pattern look like? I suspect that there is excessive backtracking occurring.

victor73 · July 18, 2019, 5:45pm

Here is one. The YESNO is a custom pattern definition that's defined as "yes|no|YES|NO":

%{FASP_DATE:timestamp} %{WORD:hostname} ascp[%{POSINT:pid}]: LOG FASP Session Params uuid=%{UUID:uuid} userid=%{NUMBER:userid} user="%{WORD:remote_user}" targetrate=%{NONNEGINT:targetrate} minrate=%{INT:minrate} rate_policy=%{WORD:rate_policy} cipher=%{DATA:cipher} resume=%{INT:resume} create=%{INT:create} ovr=%{INT:ovr} times=%{INT:times} precalc=%{WORD:precalc} mf=%{INT:mf} mf_path=%{DATA:mf_path} mf_suffix=%{DATA:mf_suffix} partial_file_suffix=%{DATA:partial_file_suffix} files_encrypt=%{YESNO:files_encrypt} files_decrypt=%{YESNO:files_decrypt} file_csum=%{WORD:file_csum} dgram_sz=%{NUMBER:dgram_sz} prepostcmd=%{DATA:prepostcmd} tcp_mode=%{YESNO:tcp_mode} rtt_auto=%{YESNO:rtt_auto} cookie="%{DATA:cookie}" vl_proto_ver=%{NUMBER:vl_proto_ver} peer_vl_proto_ver=%{NUMBER:peer_vl_proto_ver} vl_local=%{NUMBER:vl_local} vlink_remote=%{NUMBER:vlink_remote} vl_sess_id=%{DATA:vl_sess_id} srcbase=%{DATA:srcbase} rd_sz=%{NUMBER:rd_size} wr_sz=%{NUMBER:wr_sz} cluster_num_nodes=%{NONNEGINT:cluster_num_nodes} cluster_node_id=%{INT:cluster_node_id} cluster_multi_session_threshold=%{INT:cluster_multi_session_threshold} range=%{DATA:range} keepalive=%{YESNO:keepalive} test_login=%{YESNO:test_login} proxy_ip=%{DATA:proxy_ip} net_rc_alg=%{WORD:net_rc_alg} exclude_older/newer_than=%{NUMBER:exclude_older}/%{NUMBER:newer_than}

I do have another one that makes use of multiple "GREEDYDATA" fields...

%{FASP_DATE:timestamp} %{WORD:hostname} ascp[%{POSINT:pid}]: LOG FASP Session Stop uuid=%{UUID:uuid} op=%{WORD:operation} status=%{FASP_STATUS:status} errcode=%{NUMBER:errcode} errstr="%{GREEDYDATA:errstr}" source=%{DATA:source} dest=%{DATA:dest} source_prefix=%{DATA:source_prefix} local=%{IP:local_ip}:%{NUMBER:port} peer=%{IP:client_ip}:%{NUMBER:client_port} tcp_port=%{NUMBER:tcp_port} os="%{GREEDYDATA:os}" ver=%{DATA:ver} lic=%{DATA:lic} peeros="%{GREEDYDATA:client_os}" peerver=%{DATA:client_ver} peerlic=%{DATA:client_lic} proto_sess=%{NUMBER:proto_sess} proto_udp=%{NUMBER:proto_udp} proto_bwmeas=%{NUMBER:proto_bwmeas} proto_data=%{WORD:proto_data}

Badger · July 18, 2019, 5:58pm

That first pattern has 9 DATA patterns in it. It is going to be very expensive for it to determine that it does not match a field. Whereever possible, replace DATA with something else. If a field cannot contain a space then replace %{DATA:cipher} with

(?<cipher>[^ ]+)

and so on. Ideally you should only have one DATA or GREEDYDATA and it should be the last item in the pattern.

Also, most importantly, anchor your patterns. Read this elastic blog entry. If you are matching the whole of a line there is a massive performance difference between the cost of

%{FASP_DATE:timestamp} %{WORD:hostname} ascp[%{POSINT:pid}]: LOG FASP Session Params uuid=%{UUID:uuid} userid=%{NUMBER:userid} user="%{WORD:remote_user}"

and

^%{FASP_DATE:timestamp} %{WORD:hostname} ascp[%{POSINT:pid}]: LOG FASP Session Params uuid=%{UUID:uuid} userid=%{NUMBER:userid} user="%{WORD:remote_user}"

when they do not match.

Also, grok looks like the wrong approach to me. I would try using to grok with this

^%{FASP_DATE:timestamp} %{WORD:hostname} ascp[%{POSINT:pid}]: LOG FASP Session %{WORD:operation} %{GREEDYDATA:restOfLine}

and then use a kv filter to parse restOfLine.

victor73 · July 18, 2019, 6:25pm

Thanks very much. You've given me a whole bunch of leads to track down. Some outstanding recommendations and tips here. Thank you!

victor73 · July 18, 2019, 9:44pm

So far I've eliminated the use of DATA and GREEDYDATA to see if there is an improvement in the CPU consumption, but so far the situation is unchanged. I've also added a timeout_millis setting (set to 300) in order to skip events that cause the regular expression engine to take too long. Unfortunately, this ALSO doesn't seem to have any effect and the CPU still get's completely consumed. I would have expected to start seeing the events in my input (redis) go down over time since I've asked logstash to skip slow events, but it just stalls after several hundred events are removed. Any suggestions about what to try next?

Badger · July 18, 2019, 9:59pm

Are you absolutely sure you are running with the modified configuration?

victor73 · July 19, 2019, 2:05am

Just to be sure, I introduced an obvious typo into the configuration and restarted. Sure enough, a LogStash::ConfigurationError was thrown in the logs. So, since I have two inputs, I decided to turn one of them off by commenting it out. That helped quite a bit and the CPU usage came down to the single digits. Therefore, the other input (redis) must contain the events that are causing the high CPU condition, which I confirmed by reactivating it and observing 100% CPU again... I suppose the next thing to try is the kv filter technique you suggested earlier, but I'm a little disheartened that the timeout_millis setting didn't seem to do anything...

victor73 · July 20, 2019, 12:17am

Looks like removing the multitude of grok patterns and using just one followed by the use of the kv filter, was the key to resolving this problem. Many thanks!

system · August 17, 2019, 12:17am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Grock Parsing high CPU not stopping Logstash	5	576	June 22, 2019
Debugging hanging logstash Logstash	1	1287	February 21, 2017
Identifying events that cause logstash to hang Logstash	2	391	August 13, 2019
Reducing stalling_threads on restart of logstash Logstash	2	3378	August 27, 2019
Logstash stalls with grok pattern with optional match Logstash	6	1664	July 6, 2017

Determining reason for stuck grok filter

Related topics