[Solved] Logstash high CPU and unresponsive

Hi,

My platform looks like this:

Varnishlog -> filebeat -> redis -> logstash -> ES

logstash (2.3.2) are running on 3 different hw nodes, and ES (cluster) as well.

After running for a few minutes (with a clean truncated varnishlog) CPU usage for logstash goes through the roof, and the nothing happens... until I kill it (kill -9) nothing else works.

I have tweaked and played around with a plethora of settings, but to no avail.
Here is the stack while CPU usage is at 900% (16cores and 128GB RAM):

logstash: I've increased worker threads from 16 to 32.

Here is my logstash.conf:

input {

	redis {
		port => 6379
		host => "172.16.0.140"
		password => "xxxx"
		type => filebeat
		key => "filebeat"
		codec => json
		data_type => "list"
		threads => 10
	}
}
filter {
	if [type] == "apache" {
 		mutate {
            		gsub => ["message","\"","'"]
    		}
		grok {
			patterns_dir => "/usr/local/etc/logstash/patterns"
			match => [
				 "message","%{JAF_STAGE_APACHE_ACCESS}",
				 "message","%{JAF_STAGE_APACHE_ERROR}"
			]
			named_captures_only => true
		}
		geoip {
			source => "ip1"
			target => "geoip"
			database => "/usr/local/etc/logstash/GeoLiteCity.dat"
			add_field => [ "[geoip][coordinates]", "%{[geoip][longitude]}" ]
			add_field => [ "[geoip][coordinates]", "%{[geoip][latitude]}"  ]
		}
		mutate {
			convert => [ "[geoip][coordinates]", "float"]
		}
	}
	if [type] == "jaf_prod_varnish" {
		grok {
			patterns_dir => "/usr/local/etc/logstash/patterns"
			match => [
				 "message","%{JAF_PROD_VARNISH}"
			]
			named_captures_only => true
		}
	}
}

output {
	stdout { codec => rubydebug }
	elasticsearch {
		hosts => [ "172.16.0.140:9200" ]
	}
}

JAF_PROD_VARNISH is the one that causes the problem.

Here is the pattern:
JAF_PROD_VARNISH (?:%{IPORHOST:ip1}|%{IPORHOST:ip1}, %{IPORHOST:ip2}|-) - [%{HTTPDATE:timestamp}] %{WORD:method} '%{URIHOST:host}' '%{PATH:path}' '(?:%{URIPARAM:param}|)' %{NUMBER:http_status} (?:%{NUMBER:bytes}|-) %{BASE10NUM:berespms} %{WORD:cache_handling}

Elasticsearch seems to work perfectly, and no errors are observed. Same for redis.

Running logstash agent in debug mode does not reveal anything.

I have increased LS_HEAP_SIZE from 1g to 10g.

Any help, pointes and/or ideas would be greatly appreciated. I am not quite sure where to look now.

thanks

Mike

I did some further debugging (running logstash agent with --debug) - and discovered something odd:

Please observe the following output (I have changed hostnames and so on, for good measure):

The "message=>"Event now: " process seems to be looping or something odd..

Does this look normal?

thanks ,

Mike

Anyone?

Attaching truss to the pid, I see a lot of these (while logstash being unresponsive):

_umtx_op(0x80065ba88,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x18,0x7fffd8f8d728) ERR#60 'Operation timed out'
kevent(331,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(143,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(268,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
_umtx_op(0x80065b3b0,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x18,0x7fffdd8d6768) ERR#60 'Operation timed out'
kevent(161,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(292,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(113,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(316,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(95,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(259,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(319,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(227,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(212,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(295,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
lstat("/var",{ mode=drwxr-xr-x ,inode=4,size=24,blksize=4096 }) = 0 (0x0)
read(685,"7400000 0xb08000000 3052 0 0xfff"...,4096) = 4096 (0x1000)
kevent(179,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(236,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(203,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
_umtx_op(0x80065b398,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x18,0x7fffdd9d7848) ERR#60 'Operation timed out'
kevent(242,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(146,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(215,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(176,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(155,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(59,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(131,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(101,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(98,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(224,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(65,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(74,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(137,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(152,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(200,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(107,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(134,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(86,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(173,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(104,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(158,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(334,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(188,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
lstat("/var/db",{ mode=drwxr-xr-x ,inode=27,size=22,blksize=4096 }) = 0 (0x0)
kevent(233,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(248,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
_umtx_op(0x80065ba88,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x18,0x7fffd8f8d728) ERR#60 'Operation timed out'
_umtx_op(0x80065b1b8,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x18,0x7fffdedebd18) ERR#60 'Operation timed out'

Another update on this.

jstack reveals that the worker (that is using 100% of the CPU and stalls):

[main]>worker7" #64 daemon prio=5 os_prio=15 tid=0x000000090203c000 nid=0x18e61 runnable [0x00007fffd7fab000]
   java.lang.Thread.State: RUNNABLE
        at org.joni.ByteCodeMachine.matchAt(ByteCodeMachine.java:220)
        at org.joni.Matcher.matchCheck(Matcher.java:304)
        at org.joni.Matcher.searchInterruptible(Matcher.java:457)
        at org.jruby.RubyRegexp$SearchMatchTask.run(RubyRegexp.java:273)
        at org.jruby.RubyThread.executeBlockingTask(RubyThread.java:1066)
        at org.jruby.RubyRegexp.matcherSearch(RubyRegexp.java:235)
        at org.jruby.RubyRegexp.search19(RubyRegexp.java:1780)
        at org.jruby.RubyRegexp.matchPos(RubyRegexp.java:1720)
        at org.jruby.RubyRegexp.match19Common(RubyRegexp.java:1701)
        at org.jruby.RubyRegexp.match_m19(RubyRegexp.java:1680)
        at org.jruby.RubyRegexp$INVOKER$i$match_m19.call(RubyRegexp$INVOKER$i$match_m19.gen)
        at org.jruby.internal.runtime.methods.JavaMethod$JavaMethodOneOrNBlock.call(JavaMethod.java:350)
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:168)
        at rubyjit.Grok$$match_and_capture_5fff680283e4b4ebc4c0eb1732d88dbb187668931442407170.__file__(/usr/local/logstash/vendor/bundle/jruby/1.9/gems/jls-grok-0.11.2/lib/grok-pure.rb:177)
        at rubyjit.Grok$$match_and_capture_5fff680283e4b4ebc4c0eb1732d88dbb187668931442407170.__file__(/usr/local/logstash/vendor/bundle/jruby/1.9/gems/jls-grok-0.11.2/lib/grok-pure.rb)
        at org.jruby.internal.runtime.methods.JittedMethod.call(JittedMethod.java:201)
        at org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:177)
        at org.jruby.runtime.callsite.CachingCallSite.callIter(CachingCallSite.java:188)
        at rubyjit.LogStash::Filters::Grok$$match_against_groks_329a76b8714eb799f1ad4b6ad1484d7896693b861442407170.block_0$RUBY$__file__(/usr/local/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-grok-2.0.5/lib/logstash/filters/grok.rb:316)
        at rubyjit$LogStash::Filters::Grok$$match_against_groks_329a76b8714eb799f1ad4b6ad1484d7896693b861442407170$block_0$RUBY$__file__.call(rubyjit$LogStash::Filters::Grok$$match_against_groks_329a76b8714eb799f1ad4b6ad1484d7896693b861442407170$block_0$RUBY$__file__) [....]

The only grok pattern in my logstash.conf is:

	if [type] == "prod_varnish" {
		grok {
			match => { "message" => "%{IP:ip1}, %{IP:ip2} - \[%{HTTPDATE:timestamp}\] %{WORD:method} '%{URIHOST:host}' '%{PATH:path}' '(?:%{URIPARAM:param}|)' %{NUMBER:http_status} (?:%{NUMBER:bytes}|-') '(?:%{URI:referrer}|-)' %{QS:agent} %{BASE10NUM:berespms} %{WORD:cache_handling} (?:%{QS:jafapp}|-)" }
			named_captures_only => true
		}

The logstash LS_HEAP_SIZE is set to 2g - and I am running 8 worker threads.
The machine is sufficiently dimensioned, so I do not understand why this is not working..

Especially as I am sending very few lines:

2016-06-08T07:34:48.354Z --> 1m rate: 56.565485820879395 - 5m rate: 53.790223826548456 - 15m rate: 53.035713242304084
2016-06-08T07:34:58.359Z --> 1m rate: 53.693102496291594 - 5m rate: 53.31094344739779 - 15m rate: 52.88493649562443
2016-06-08T07:35:08.362Z --> 1m rate: 45.45023001578945 - 5m rate: 51.56320283420884 - 15m rate: 52.30057853513214
2016-06-08T07:35:18.365Z --> 1m rate: 38.47278910044062 - 5m rate: 49.872759973666184 - 15m rate: 51.722677502616285
2016-06-08T07:35:20.470Z --> 1m rate: 38.47278910044062 - 5m rate: 49.872759973666184 - 15m rate: 51.722677502616285
2016-06-08T07:35:20.470Z --> 1m rate: 38.47278910044062 - 5m rate: 49.872759973666184 - 15m rate: 51.722677502616285
2016-06-08T07:35:20.470Z --> 1m rate: 38.47278910044062 - 5m rate: 49.872759973666184 - 15m rate: 51.722677502616285

Can anyone help? Much appreciated!

Mike

Hi,
after your great feedback data, my hunch is that correctly this looks like a grok issue. To try to understand a bit further, can you confirm the next set of information:

  • According to https://github.com/elastic/logstash/issues/5437 you use FreeBSD 10.3-REL, if still true this is actually a platform we don't test with, will try to reproduce this same behaviour with other unix and see if I see the same.

  • Can you share which JVM you use? I could not see this information here or at github?

  • As I could see, the initial LS config is different from this last one, I am understanding this correctly? can you share a more complete LS config to easy reproduction ?

  • Not sure if I saw any example log lines? can you share a few? if you have confidential information please redact it :slight_smile:. This will enable reproduction more easy.

/cheers

  • purbon

hi @purbon

Thanks for your reply.
Since then I managed to solve the problem - it was indeed a grok pattern that would hog all the CPU and eventually time out / crash.

I made the following changes to the pattern listed above, and the problem went away:

'(?:%{URI:referrer}|-)' --> '(?:%{NOTSPACE:referrer}|-)'

and

'%{PATH:path}' --> '%{NOTSPACE:path}'

Apparently referrers of this type:

'android-app://com.google.android.googlequicksearchbox'

would break the pattern matching.

Also malformed URLs with lots of "%20" in them, but break the PATH grok pattern.

The NOTSPACE pattern is much more forgiving.

Thanks,

Hi,
thanks for your closing the loop on this issue, is very important to others can also learn from your experience. If not mislead my regular expressions foo, and from what I understood from your explanation, I see that basically the URI pattern was causing lots of failures, I guess I can see the pattern as your host include - while the URIPROTO pattern is strict with requiring only letters.

NOTSPACE will basically take any non-whitespace characters, so will certainly match a more wide set of characters.

With grok, getting to some degree of ReDoS is something to keep always in mind.

Yep - that's it!