I've tried using logstash versions 6.8.1, 7.0 and 7.2 and, regardless of version, logstash seems to stop it's processing and begin consuming all the available CPU. It seems to be stuck with the grok filter. I suspect this is the case, because when I shut down logstash, I get messages regarding the org.logstash.execution.ShutdownWatcherExt and stalling_threads_info that contain my actual grok patterns that I'm using.
In addition, the tail end of those stalling_threads_info messages look like this:
Is there someway to determine, what input/log event is causing the regular expression, or pattern, to go crazy like this? Is there a recommended approach to resolve issues such as this?
That first pattern has 9 DATA patterns in it. It is going to be very expensive for it to determine that it does not match a field. Whereever possible, replace DATA with something else. If a field cannot contain a space then replace %{DATA:cipher} with
(?<cipher>[^ ]+)
and so on. Ideally you should only have one DATA or GREEDYDATA and it should be the last item in the pattern.
Also, most importantly, anchor your patterns. Read this elastic blog entry. If you are matching the whole of a line there is a massive performance difference between the cost of
So far I've eliminated the use of DATA and GREEDYDATA to see if there is an improvement in the CPU consumption, but so far the situation is unchanged. I've also added a timeout_millis setting (set to 300) in order to skip events that cause the regular expression engine to take too long. Unfortunately, this ALSO doesn't seem to have any effect and the CPU still get's completely consumed. I would have expected to start seeing the events in my input (redis) go down over time since I've asked logstash to skip slow events, but it just stalls after several hundred events are removed. Any suggestions about what to try next?
Just to be sure, I introduced an obvious typo into the configuration and restarted. Sure enough, a LogStash::ConfigurationError was thrown in the logs. So, since I have two inputs, I decided to turn one of them off by commenting it out. That helped quite a bit and the CPU usage came down to the single digits. Therefore, the other input (redis) must contain the events that are causing the high CPU condition, which I confirmed by reactivating it and observing 100% CPU again... I suppose the next thing to try is the kv filter technique you suggested earlier, but I'm a little disheartened that the timeout_millis setting didn't seem to do anything...
Looks like removing the multitude of grok patterns and using just one followed by the use of the kv filter, was the key to resolving this problem. Many thanks!
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.