[Solved] Logstash high CPU and unresponsive

michbsd · May 29, 2016, 6:26pm

Hi,

My platform looks like this:

Varnishlog -> filebeat -> redis -> logstash -> ES

logstash (2.3.2) are running on 3 different hw nodes, and ES (cluster) as well.

After running for a few minutes (with a clean truncated varnishlog) CPU usage for logstash goes through the roof, and the nothing happens... until I kill it (kill -9) nothing else works.

I have tweaked and played around with a plethora of settings, but to no avail.
Here is the stack while CPU usage is at 900% (16cores and 128GB RAM):

gist.github.com

https://gist.github.com/michbsd/3af950b5234dc9e81f29db0868871e71

jstack.out

2016-05-29 19:56:03
Full thread dump OpenJDK 64-Bit Server VM (25.77-b03 mixed mode):

"Attach Listener" #85 daemon prio=9 os_prio=15 tid=0x0000000ae9024000 nid=0x18ea5 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Ruby-0-Thread-64: /usr/local/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.3.2-java/lib/logstash/pipeline.rb:444" #84 daemon prio=5 os_prio=15 tid=0x0000000b0becd800 nid=0x18df7 in Object.wait() [0x00007fffd5799000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	at org.jruby.RubyThread.sleep(RubyThread.java:1002)

This file has been truncated. show original

logstash: I've increased worker threads from 16 to 32.

Here is my logstash.conf:

input {

	redis {
		port => 6379
		host => "172.16.0.140"
		password => "xxxx"
		type => filebeat
		key => "filebeat"
		codec => json
		data_type => "list"
		threads => 10
	}
}
filter {
	if [type] == "apache" {
 		mutate {
            		gsub => ["message","\"","'"]
    		}
		grok {
			patterns_dir => "/usr/local/etc/logstash/patterns"
			match => [
				 "message","%{JAF_STAGE_APACHE_ACCESS}",
				 "message","%{JAF_STAGE_APACHE_ERROR}"
			]
			named_captures_only => true
		}
		geoip {
			source => "ip1"
			target => "geoip"
			database => "/usr/local/etc/logstash/GeoLiteCity.dat"
			add_field => [ "[geoip][coordinates]", "%{[geoip][longitude]}" ]
			add_field => [ "[geoip][coordinates]", "%{[geoip][latitude]}"  ]
		}
		mutate {
			convert => [ "[geoip][coordinates]", "float"]
		}
	}
	if [type] == "jaf_prod_varnish" {
		grok {
			patterns_dir => "/usr/local/etc/logstash/patterns"
			match => [
				 "message","%{JAF_PROD_VARNISH}"
			]
			named_captures_only => true
		}
	}
}

output {
	stdout { codec => rubydebug }
	elasticsearch {
		hosts => [ "172.16.0.140:9200" ]
	}
}

JAF_PROD_VARNISH is the one that causes the problem.

Here is the pattern:
JAF_PROD_VARNISH (?:%{IPORHOST:ip1}|%{IPORHOST:ip1}, %{IPORHOST:ip2}|-) - [%{HTTPDATE:timestamp}] %{WORD:method} '%{URIHOST:host}' '%{PATH:path}' '(?:%{URIPARAM:param}|)' %{NUMBER:http_status} (?:%{NUMBER:bytes}|-) %{BASE10NUM:berespms} %{WORD:cache_handling}

Elasticsearch seems to work perfectly, and no errors are observed. Same for redis.

Running logstash agent in debug mode does not reveal anything.

I have increased LS_HEAP_SIZE from 1g to 10g.

Any help, pointes and/or ideas would be greatly appreciated. I am not quite sure where to look now.

thanks

Mike

michbsd · May 29, 2016, 7:40pm

I did some further debugging (running logstash agent with --debug) - and discovered something odd:

Please observe the following output (I have changed hostnames and so on, for good measure):

gist.github.com

https://gist.github.com/michbsd/38b63cc287afb1a5049b5cb5db6889e7

debug

   {:timestamp=>"2016-05-29T21:21:54.857000+0200", :message=>"filter received", :event=>{"@timestamp"=>"2016-05-29T19:21:29.341Z", "beat"=>{"hostname"=>"xxx.myvarnish.com", "name"=>"xxx.myvarnish.com"}, "input_type"=>"log", "message"=>"12.183.71.67, 172.31.255.248 - [29/May/2016:21:20:16 +0200] GET 'www.mysite.com' '/medias/2015/10/05/some-image.jpg' '' 304 0 0.000044 hit", "offset"=>4621846, "source"=>"/var/log/varnishncsa.log", "type"=>"jaf_prod_varnish", "@version"=>"1"}, :level=>:debug, :file=>"(eval)", :line=>"73", :method=>"filter_func"}


    {:timestamp=>"2016-05-29T21:21:54.857000+0200", :message=>"Running grok filter", :event=>#<LogStash::Event:0x5f576d19 @metadata_accessors=#<LogStash::Util::Accessors:0x39da7b20 @store={}, @lut={}>, @cancelled=false, @data={"@timestamp"=>"2016-05-29T19:21:29.341Z", "beat"=>{"hostname"=>"xxx.myvarnish.com", "name"=>"xxx.myvarnish.com"}, "input_type"=>"log", "message"=>"12.183.71.67, 172.31.255.248 - [29/May/2016:21:20:16 +0200] GET 'www.mysite.com' '/medias/2015/10/05/some-image.jpg' '' 304 0 0.000044 hit", "offset"=>4621846, "source"=>"/var/log/varnishncsa.log", "type"=>"jaf_prod_varnish", "@version"=>"1"}, @metadata={}, @accessors=#<LogStash::Util::Accessors:0x13c388f4 @store={"@timestamp"=>"2016-05-29T19:21:29.341Z", "beat"=>{"hostname"=>"xxx.myvarnish.com", "name"=>"xxx.myvarnish.com"}, "input_type"=>"log", "message"=>"12.183.71.67, 172.31.255.248 - [29/May/2016:21:20:16 +0200] GET 'www.mysite.com' '/medias/2015/10/05/some-image.jpg' '' 304 0 0.000044 hit", "offset"=>4621846, "source"=>"/var/log/varnishncsa.log", "type"=>"jaf_prod_varnish", "@version"=>"1"}, @lut={"type"=>[{"@timestamp"=>"2016-05-29T19:21:29.341Z", "beat"=>{"hostname"=>"xxx.myvarnish.com", "name"=>"xxx.myvarnish.com"}, "input_type"=>"log", "message"=>"12.183.71.67, 172.31.255.248 - [29/May/2016:21:20:16 +0200] GET 'www.mysite.com' '/medias/2015/10/05/some-image.jpg' '' 304 0 0.000044 hit", "offset"=>4621846, "source"=>"/var/log/varnishncsa.log", "type"=>"jaf_prod_varnish", "@version"=>"1"}, "type"], "[type]"=>[{"@timestamp"=>"2016-05-29T19:21:29.341Z", "beat"=>{"hostname"=>"xxx.myvarnish.com", "name"=>"xxx.myvarnish.com"}, "input_type"=>"log", "message"=>"12.183.71.67, 172.31.255.248 - [29/May/2016:21:20:16 +0200] GET 'www.mysite.com' '/medias/2015/10/05/some-image.jpg' '' 304 0 0.000044 hit", "offset"=>4621846, "source"=>"/var/log/varnishncsa.log", "type"=>"jaf_prod_varnish", "@version"=>"1"}, "type"]}>>, :level=>:debug, :file=>"logstash/filters/grok.rb", :line=>"277", :method=>"filter"}


    {:timestamp=>"2016-05-29T21:21:54.860000+0200", :message=>"Event now: ", :event=>#<LogStash::Event:0x5f576d19 @metadata_accessors=#<LogStash::Util::Accessors:0x39da7b20 @store={}, @lut={}>, @cancelled=false, @data={"@timestamp"=>"2016-05-29T19:21:29.341Z", "beat"=>{"hostname"=>"xxx.myvarnish.com", "name"=>"xxx.myvarnish.com"}, "input_type"=>"log", "message"=>"12.183.71.67, 172.31.255.248 - [29/May/2016:21:20:16 +0200] GET 'www.mysite.com' '/medias/2015/10/05/some-image.jpg' '' 304 0 0.000044 hit", "offset"=>4621846, "source"=>"/var/log/varnishncsa.log", "type"=>"jaf_prod_varnish", "@version"=>"1", "ip1"=>"12.183.71.67", "ip2"=>"172.31.255.248", "timestamp"=>"29/May/2016:21:20:16 +0200", "method"=>"GET", "host"=>"www.mysite.com", "path"=>"/medias/2015/10/05/some-image.jpg", "http_status"=>"304", "bytes"=>"0", "berespms"=>"0.000044", "cache_handling"=>"hit"}, @metadata={}, @accessors=#<LogStash::Util::Accessors:0x13c388f4 @store={"@timestamp"=>"2016-05-29T19:21:29.341Z", "beat"=>{"hostname"=>"xxx.myvarnish.com", "name"=>"xxx.myvarnish.com"}, "input_type"=>"log", "message"=>"12.183.71.67, 172.31.255.248 - [29/May/2016:21:20:16 +0200] GET 'www.mysite.com' '/medias/2015/10/05/some-image.jpg' '' 304 0 0.000044 hit", "offset"=>4621846, "source"=>"/var/log/varnishncsa.log", "type"=>"jaf_prod_varnish", "@version"=>"1", "ip1"=>"12.183.71.67", "ip2"=>"172.31.255.248", "timestamp"=>"29/May/2016:21:20:16 +0200", "method"=>"GET", "host"=>"www.mysite.com", "path"=>"/medias/2015/10/05/some-image.jpg", "http_status"=>"304", "bytes"=>"0", "berespms"=>"0.000044", "cache_handling"=>"hit"}, @lut={"type"=>[{"@timestamp"=>"2016-05-29T19:21:29.341Z", "beat"=>{"hostname"=>"xxx.myvarnish.com", "name"=>"xxx.myvarnish.com"}, "input_type"=>"log", "message"=>"12.183.71.67, 172.31.255.248 - [29/May/2016:21:20:16 +0200] GET 'www.mysite.com' '/medias/2015/10/05/some-image.jpg' '' 304 0 0.000044 hit", "offset"=>4621846, "source"=>"/var/log/varnishncsa.log", "type"=>"jaf_prod_varnish", "@version"=>"1", "ip1"=>"12.183.71.67", "ip2"=>"172.31.255.248", "timestamp"=>"29/May/2016:21:20:16 +0200", "method"=>"GET", "host"=>"www.mysite.com", "path"=>"/medias/2015/10/05/some-image.jpg", "http_status"=>"304", "bytes"=>"0", "berespms"=>"0.000044", "cache_handling"=>"hit"}, "type"], "[type]"=>[{"@timestamp"=>"2016-05-29T19:21:29.341Z", "beat"=>{"hostname"=>"xxx.myvarnish.com", "name"=>"xxx.myvarnish.com"}, "input_type"=>"log", "message"=>"12.183.71.67, 172.31.255.248 - [29/May/2016:21:20:16 +0200] GET 'www.mysite.com' '/medias/2015/10/05/some-image.jpg' '' 304 0 0.000044 hit", "offset"=>4621846, "source"=>"/var/log/varnishncsa.log", "type"=>"jaf_prod_varnish", "@version"=>"1", "ip1"=>"12.183.71.67", "ip2"=>"172.31.255.248", "timestamp"=>"29/May/2016:21:20:16 +0200", "method"=>"GET", "host"=>"www.mysite.com", "path"=>"/medias/2015/10/05/some-image.jpg", "http_status"=>"304", "bytes"=>"0", "berespms"=>"0.000044", "cache_handling"=>"hit"}, "type"], "message"=>[{"@timestamp"=>"2016-05-29T19:21:29.341Z", "beat"=>{"hostname"=>"xxx.myvarnish.com", "name"=>"xxx.myvarnish.com"}, "input_type"=>"log", "message"=>"12.183.71.67, 172.31.255.248 - [29/May/2016:21:20:16 +0200] GET 'www.mysite.com' '/medias/2015/10/05/some-image.jpg' '' 304 0 0.000044 hit", "offset"=>4621846, "source"=>"/var/log/varnishncsa.log", "type"=>"jaf_prod_varnish", "@version"=>"1", "ip1"=>"12.183.71.67", "ip2"=>"172.31.255.248", "timestamp"=>"29/May/2016:21:20:16 +0200", "method"=>"GET", "host"=>"www.mysite.com", "path"=>"/medias/2015/10/05/some-image.jpg", "http_status"=>"304", "bytes"=>"0", "berespms"=>"0.000044", "cache_handling"=>"hit"}, "message"], "ip1"=>[{"@timestamp"=>"2016-05-29T19:21:29.341Z", "beat"=>{"hostname"=>"xxx.myvarnish.com", "name"=>"xxx.myvarnish.com"}, "input_type"=>"log", "message"=>"12.183.71.67, 172.31.255.248 - [29/May/2016:21:20:16 +0200] GET 'www.mysite.com' '/medias/2015/10/05/some-image.jpg' '' 304 0 0.000044 hit", "offset"=>4621846, "source"=>"/var/log/varnishncsa.log", "type"=>"jaf_prod_varnish", "@version"=>"1", "ip1"=>"12.183.71.67", "ip2"=>"172.31.255.248", "timestamp"=>"29/May/2016:21:20:16 +0200", "method"=>"GET", "host"=>"www.mysite.com", "path"=>"/medias/2015/10/05/some-image.jpg", "http_status"=>"304", "bytes"=>"0", "berespms"=>"0.000044", "cache_handling"=>"hit"}, "ip1"], "ip2"=>[{"@timestamp"=>"2016-05-29T19:21:29.341Z", "beat"=>{"hostname"=>"xxx.myvarnish.com", "name"=>"xxx.myvarnish.com"}, "input_type"=>"log", "message"=>"12.183.71.67, 172.31.255.248 - [29/May/2016:21:20:16 +0200] GET 'www.mysite.com' '/medias/2015/10/05/some-image.jpg' '' 304 0 0.000044 hit", "offset"=>4621846, "source"=>"/var/log/varnishncsa.log", "type"=>"jaf_prod_varnish", "@version"=>"1", "ip1"=>"12.183.71.67", "ip2"=>"172.31.255.248", "timestamp"=>"29/May/2016:21:20:16 +0200", "method"=>"GET", "host"=>"www.mysite.com", "path"=>"/medias/2015/10/05/some-image.jpg", "http_status"=>"304", "bytes"=>"0", "berespms"=>"0.000044", "cache_handling"=>"hit"}, "ip2"], "timestamp"=>[{"@timestamp"=>"2016-05-29T19:21:29.341Z", "beat"=>{"hostname"=>"xxx.myvarnish.com", "name"=>"xxx.myvarnish.com"}, "input_type"=>"log", "message"=>"12.183.71.67, 172.31.255.248 - [29/May/2016:21:20:16 +0200] GET 'www.mysite.com' '/medias/2015/10/05/some-image.jpg' '' 304 0 0.000044 hit", "offset"=>4621846, "source"=>"/var/log/varnishncsa.log", "type"=>"jaf_prod_varnish", "@version"=>"1", "ip1"=>"12.183.71.67", "ip2"=>"172.31.255.248", "timestamp"=>"29/May/2016:21:20:16 +0200", "method"=>"GET", "host"=>"www.mysite.com", "path"=>"/medias/2015/10/05/some-image.jpg", "http_status"=>"304", "bytes"=>"0", "berespms"=>"0.000044", "cache_handling"=>"hit"}, "timestamp"], "method"=>[{"@timestamp"=>"2016-05-29T19:21:29.341Z", "beat"=>{"hostname"=>"xxx.myvarnish.com", "name"=>"xxx.myvarnish.com"}, "input_type"=>"log", "message"=>"12.183.71.67, 172.31.255.248 - [29/May/2016:21:20:16 +0200] GET 'www.mysite.com' '/medias/2015/10/05/some-image.jpg' '' 304 0 0.000044 hit", "offset"=>4621846, "source"=>"/var/log/varnishncsa.log", "type"=>"jaf_prod_varnish", "@version"=>"1", "ip1"=>"12.183.71.67", "ip2"=>"172.31.255.248", "timestamp"=>"29/May/2016:21:20:16 +0200", "method"=>"GET", "host"=>"www.mysite.com", "path"=>"/medias/2015/10/05/some-image.jpg", "http_status"=>"304", "bytes"=>"0", "berespms"=>"0.000044", "cache_handling"=>"hit"}, "method"], "host"=>[{"@timestamp"=>"2016-05-29T19:21:29.341Z", "beat"=>{"hostname"=>"xxx.myvarnish.com", "name"=>"xxx.myvarnish.com"}, "input_type"=>"log", "message"=>"12.183.71.67, 172.31.255.248 - [29/May/2016:21:20:16 +0200] GET 'www.mysite.com' '/medias/2015/10/05/some-image.jpg' '' 304 0 0.000044 hit", "offset"=>4621846, "source"=>"/var/log/varnishncsa.log", "type"=>"jaf_prod_varnish", "@version"=>"1", "ip1"=>"12.183.71.67", "ip2"=>"172.31.255.248", "timestamp"=>"29/May/2016:21:20:16 +0200", "method"=>"GET", "host"=>"www.mysite.com", "path"=>"/medias/2015/10/05/some-image.jpg", "http_status"=>"304", "bytes"=>"0", "berespms"=>"0.000044", "cache_handling"=>"hit"}, "host"], "path"=>[{"@timestamp"=>"2016-05-29T19:21:29.341Z", "beat"=>{"hostname"=>"xxx.myvarnish.com", "name"=>"xxx.myvarnish.com"}, "input_type"=>"log", "message"=>"12.183.71.67, 172.31.255.248 - [29/May/2016:21:20:16 +0200] GET 'www.mysite.com' '/medias/2015/10/05/some-image.jpg' '' 304 0 0.000044 hit", "offset"=>4621846, "source"=>"/var/log/varnishncsa.log", "type"=>"jaf_prod_varnish", "@version"=>"1", "ip1"=>"12.183.71.67", "ip2"=>"172.31.255.248", "timestamp"=>"29/May/2016:21:20:16 +0200", "method"=>"GET", "host"=>"www.mysite.com", "path"=>"/medias/2015/10/05/some-image.jpg", "http_status"=>"304", "bytes"=>"0", "berespms"=>"0.000044", "cache_handling"=>"hit"}, "path"], "http_status"=>[{"@timestamp"=>"2016-05-29T19:21:29.341Z", "beat"=>{"hostname"=>"xxx.myvarnish.com", "name"=>"xxx.myvarnish.com"}, "input_type"=>"log", "message"=>"12.183.71.67, 172.31.255.248 - [29/May/2016:21:20:16 +0200] GET 'www.mysite.com' '/medias/2015/10/05/some-image.jpg' '' 304 0 0.000044 hit", "offset"=>4621846, "source"=>"/var/log/varnishncsa.log", "type"=>"jaf_prod_varnish", "@version"=>"1", "ip1"=>"12.183.71.67", "ip2"=>"172.31.255.248", "timestamp"=>"29/May/2016:21:20:16 +0200", "method"=>"GET", "host"=>"www.mysite.com", "path"=>"/medias/2015/10/05/some-image.jpg", "http_status"=>"304", "bytes"=>"0", "berespms"=>"0.000044", "cache_handling"=>"hit"}, "http_status"], "bytes"=>[{"@timestamp"=>"2016-05-29T19:21:29.341Z", "beat"=>{"hostname"=>"xxx.myvarnish.com", "name"=>"xxx.myvarnish.com"}, "input_type"=>"log", "message"=>"12.183.71.67, 172.31.255.248 - [29/May/2016:21:20:16 +0200] GET 'www.mysite.com' '/medias/2015/10/05/some-image.jpg' '' 304 0 0.000044 hit", "offset"=>4621846, "source"=>"/var/log/varnishncsa.log", "type"=>"jaf_prod_varnish", "@version"=>"1", "ip1"=>"12.183.71.67", "ip2"=>"172.31.255.248", "timestamp"=>"29/May/2016:21:20:16 +0200", "method"=>"GET", "host"=>"www.mysite.com", "path"=>"/medias/2015/10/05/some-image.jpg", "http_status"=>"304", "bytes"=>"0", "berespms"=>"0.000044", "cache_handling"=>"hit"}, "bytes"], "berespms"=>[{"@timestamp"=>"2016-05-29T19:21:29.341Z", "beat"=>{"hostname"=>"xxx.myvarnish.com", "name"=>"xxx.myvarnish.com"}, "input_type"=>"log", "message"=>"12.183.71.67, 172.31.255.248 - [29/May/2016:21:20:16 +0200] GET 'www.mysite.com' '/medias/2015/10/05/some-image.jpg' '' 304 0 0.000044 hit", "offset"=>4621846, "source"=>"/var/log/varnishncsa.log", "type"=>"jaf_prod_varnish", "@version"=>"1", "ip1"=>"12.183.71.67", "ip2"=>"172.31.255.248", "timestamp"=>"29/May/2016:21:20:16 +0200", "method"=>"GET", "host"=>"www.mysite.com", "path"=>"/medias/2015/10/05/some-image.jpg", "http_status"=>"304", "bytes"=>"0", "berespms"=>"0.000044", "cache_handling"=>"hit"}, "berespms"], "cache_handling"=>[{"@timestamp"=>"2016-05-29T19:21:29.341Z", "beat"=>{"hostname"=>"xxx.myvarnish.com", "name"=>"xxx.myvarnish.com"}, "input_type"=>"log", "message"=>"12.183.71.67, 172.31.255.248 - [29/May/2016:21:20:16 +0200] GET 'www.mysite.com' '/medias/2015/10/05/some-image.jpg' '' 304 0 0.000044 hit", "offset"=>4621846, "source"=>"/var/log/varnishncsa.log", "type"=>"jaf_prod_varnish", "@version"=>"1", "ip1"=>"12.183.71.67", "ip2"=>"172.31.255.248", "timestamp"=>"29/May/2016:21:20:16 +0200", "method"=>"GET", "host"=>"www.mysite.com", "path"=>"/medias/2015/10/05/some-image.jpg", "http_status"=>"304", "bytes"=>"0", "berespms"=>"0.000044", "cache_handling"=>"hit"}, "cache_handling"]}>>, :level=>:debug, :file=>"logstash/filters/grok.rb", :line=>"292", :method=>"filter"}

This file has been truncated. show original

The "message=>"Event now: " process seems to be looping or something odd..

Does this look normal?

thanks ,

Mike

michbsd · May 30, 2016, 5:05pm

Anyone?

michbsd · June 7, 2016, 11:28am

Attaching truss to the pid, I see a lot of these (while logstash being unresponsive):

_umtx_op(0x80065ba88,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x18,0x7fffd8f8d728) ERR#60 'Operation timed out'
kevent(331,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(143,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(268,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
_umtx_op(0x80065b3b0,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x18,0x7fffdd8d6768) ERR#60 'Operation timed out'
kevent(161,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(292,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(113,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(316,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(95,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(259,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(319,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(227,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(212,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(295,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
lstat("/var",{ mode=drwxr-xr-x ,inode=4,size=24,blksize=4096 }) = 0 (0x0)
read(685,"7400000 0xb08000000 3052 0 0xfff"...,4096) = 4096 (0x1000)
kevent(179,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(236,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(203,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
_umtx_op(0x80065b398,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x18,0x7fffdd9d7848) ERR#60 'Operation timed out'
kevent(242,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(146,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(215,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(176,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(155,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(59,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(131,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(101,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(98,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(224,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(65,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(74,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(137,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(152,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(200,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(107,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(134,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(86,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(173,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(104,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(158,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(334,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(188,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
lstat("/var/db",{ mode=drwxr-xr-x ,inode=27,size=22,blksize=4096 }) = 0 (0x0)
kevent(233,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
kevent(248,0x0,0,{ },128,{ 0.500000000 }) = 0 (0x0)
_umtx_op(0x80065ba88,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x18,0x7fffd8f8d728) ERR#60 'Operation timed out'
_umtx_op(0x80065b1b8,UMTX_OP_WAIT_UINT_PRIVATE,0x0,0x18,0x7fffdedebd18) ERR#60 'Operation timed out'

michbsd · June 8, 2016, 7:45am

Another update on this.

jstack reveals that the worker (that is using 100% of the CPU and stalls):

[main]>worker7" #64 daemon prio=5 os_prio=15 tid=0x000000090203c000 nid=0x18e61 runnable [0x00007fffd7fab000]
   java.lang.Thread.State: RUNNABLE
        at org.joni.ByteCodeMachine.matchAt(ByteCodeMachine.java:220)
        at org.joni.Matcher.matchCheck(Matcher.java:304)
        at org.joni.Matcher.searchInterruptible(Matcher.java:457)
        at org.jruby.RubyRegexp$SearchMatchTask.run(RubyRegexp.java:273)
        at org.jruby.RubyThread.executeBlockingTask(RubyThread.java:1066)
        at org.jruby.RubyRegexp.matcherSearch(RubyRegexp.java:235)
        at org.jruby.RubyRegexp.search19(RubyRegexp.java:1780)
        at org.jruby.RubyRegexp.matchPos(RubyRegexp.java:1720)
        at org.jruby.RubyRegexp.match19Common(RubyRegexp.java:1701)
        at org.jruby.RubyRegexp.match_m19(RubyRegexp.java:1680)
        at org.jruby.RubyRegexp$INVOKER$i$match_m19.call(RubyRegexp$INVOKER$i$match_m19.gen)
        at org.jruby.internal.runtime.methods.JavaMethod$JavaMethodOneOrNBlock.call(JavaMethod.java:350)
        at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:168)
        at rubyjit.Grok$$match_and_capture_5fff680283e4b4ebc4c0eb1732d88dbb187668931442407170.__file__(/usr/local/logstash/vendor/bundle/jruby/1.9/gems/jls-grok-0.11.2/lib/grok-pure.rb:177)
        at rubyjit.Grok$$match_and_capture_5fff680283e4b4ebc4c0eb1732d88dbb187668931442407170.__file__(/usr/local/logstash/vendor/bundle/jruby/1.9/gems/jls-grok-0.11.2/lib/grok-pure.rb)
        at org.jruby.internal.runtime.methods.JittedMethod.call(JittedMethod.java:201)
        at org.jruby.runtime.callsite.CachingCallSite.callBlock(CachingCallSite.java:177)
        at org.jruby.runtime.callsite.CachingCallSite.callIter(CachingCallSite.java:188)
        at rubyjit.LogStash::Filters::Grok$$match_against_groks_329a76b8714eb799f1ad4b6ad1484d7896693b861442407170.block_0$RUBY$__file__(/usr/local/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-grok-2.0.5/lib/logstash/filters/grok.rb:316)
        at rubyjit$LogStash::Filters::Grok$$match_against_groks_329a76b8714eb799f1ad4b6ad1484d7896693b861442407170$block_0$RUBY$__file__.call(rubyjit$LogStash::Filters::Grok$$match_against_groks_329a76b8714eb799f1ad4b6ad1484d7896693b861442407170$block_0$RUBY$__file__) [....]

The only grok pattern in my logstash.conf is:

	if [type] == "prod_varnish" {
		grok {
			match => { "message" => "%{IP:ip1}, %{IP:ip2} - \[%{HTTPDATE:timestamp}\] %{WORD:method} '%{URIHOST:host}' '%{PATH:path}' '(?:%{URIPARAM:param}|)' %{NUMBER:http_status} (?:%{NUMBER:bytes}|-') '(?:%{URI:referrer}|-)' %{QS:agent} %{BASE10NUM:berespms} %{WORD:cache_handling} (?:%{QS:jafapp}|-)" }
			named_captures_only => true
		}

The logstash LS_HEAP_SIZE is set to 2g - and I am running 8 worker threads.
The machine is sufficiently dimensioned, so I do not understand why this is not working..

Especially as I am sending very few lines:

2016-06-08T07:34:48.354Z --> 1m rate: 56.565485820879395 - 5m rate: 53.790223826548456 - 15m rate: 53.035713242304084
2016-06-08T07:34:58.359Z --> 1m rate: 53.693102496291594 - 5m rate: 53.31094344739779 - 15m rate: 52.88493649562443
2016-06-08T07:35:08.362Z --> 1m rate: 45.45023001578945 - 5m rate: 51.56320283420884 - 15m rate: 52.30057853513214
2016-06-08T07:35:18.365Z --> 1m rate: 38.47278910044062 - 5m rate: 49.872759973666184 - 15m rate: 51.722677502616285
2016-06-08T07:35:20.470Z --> 1m rate: 38.47278910044062 - 5m rate: 49.872759973666184 - 15m rate: 51.722677502616285
2016-06-08T07:35:20.470Z --> 1m rate: 38.47278910044062 - 5m rate: 49.872759973666184 - 15m rate: 51.722677502616285
2016-06-08T07:35:20.470Z --> 1m rate: 38.47278910044062 - 5m rate: 49.872759973666184 - 15m rate: 51.722677502616285

Can anyone help? Much appreciated!

Mike

purbon · August 13, 2016, 9:48am

Hi,
after your great feedback data, my hunch is that correctly this looks like a grok issue. To try to understand a bit further, can you confirm the next set of information:

According to https://github.com/elastic/logstash/issues/5437 you use FreeBSD 10.3-REL, if still true this is actually a platform we don't test with, will try to reproduce this same behaviour with other unix and see if I see the same.
Can you share which JVM you use? I could not see this information here or at github?
As I could see, the initial LS config is different from this last one, I am understanding this correctly? can you share a more complete LS config to easy reproduction ?
Not sure if I saw any example log lines? can you share a few? if you have confidential information please redact it . This will enable reproduction more easy.

/cheers

purbon

michbsd · August 16, 2016, 10:11am

hi @purbon

Thanks for your reply.
Since then I managed to solve the problem - it was indeed a grok pattern that would hog all the CPU and eventually time out / crash.

I made the following changes to the pattern listed above, and the problem went away:

'(?:%{URI:referrer}|-)' --> '(?:%{NOTSPACE:referrer}|-)'

and

'%{PATH:path}' --> '%{NOTSPACE:path}'

Apparently referrers of this type:

'android-app://com.google.android.googlequicksearchbox'

would break the pattern matching.

Also malformed URLs with lots of "%20" in them, but break the PATH grok pattern.

The NOTSPACE pattern is much more forgiving.

Thanks,

purbon · August 17, 2016, 8:05am

Hi,
thanks for your closing the loop on this issue, is very important to others can also learn from your experience. If not mislead my regular expressions foo, and from what I understood from your explanation, I see that basically the URI pattern was causing lots of failures, I guess I can see the pattern as your host include - while the URIPROTO pattern is strict with requiring only letters.

NOTSPACE will basically take any non-whitespace characters, so will certainly match a more wide set of characters.

With grok, getting to some degree of ReDoS is something to keep always in mind.

michbsd · August 19, 2016, 11:47am

Yep - that's it!

Topic		Replies	Views
High cpu and Beats input: the pipeline is blocked, temporary refusing new connection Logstash	13	2692	July 6, 2017
Debug Logstash hot_threads Logstash	4	1296	January 19, 2018
Logstash at 100% CPU, slow to process Redis queue to Elasticsearch Logstash	3	1066	July 6, 2017
High Java CPU usage with fresh elasticsearch install Elasticsearch	4	2222	January 18, 2018
High CPU on logstash cluster Logstash	5	867	August 22, 2017

[Solved] Logstash high CPU and unresponsive

Related topics