"Counters can not be incremented by negative values" error when using metric filters freezes workers and "corrupts" the persistent queue

Hi,

I am running Logstash-6.0.0 with metrics filters to track performance.

I have noticed that performance of some of my nodes decreases over time. On checking the logs I found log messages like the following

[2017-11-22T11:05:41,195][ERROR][logstash.pipeline ] Exception in pipelineworker, the pipeline stopped processing new events, please check your filter configuration and restart Logstash. {:pipeline_id=>"main", "exception"=>"Counters can not be incremented by negative values", "backtrace"=>["org.logstash.instrument.metrics.counter.LongCounter.increment(org/logstash/instrument/metrics/counter/LongCounter.java:61)", "java.lang.reflect.Method.invoke(java/lang/reflect/Method.java:498)", "org.jruby.javasupport.JavaMethod.invokeDirectWithExceptionHandling(org/jruby/javasupport/JavaMethod.java:453)", "org.jruby.javasupport.JavaMethod.tryProxyInvocation(org/jruby/javasupport/JavaMethod.java:633)", "org.jruby.javasupport.JavaMethod.invokeDirect(org/jruby/javasupport/JavaMethod.java:311)", "usr.share.logstash.logstash_minus_core.lib.logstash.filter_delegator.invokeOther17:increment(usr/share/logstash/logstash_minus_core/lib/logstash//usr/share/logstash/logstash-core/lib/logstash/filter_delegator.rb:49)", "usr.share.logstash.logstash_minus_core.lib.logstash.filter_delegator.multi_filter(/usr/share/logstash/logstash-core/lib/logstash/filter_delegator.rb:49)", "RUBY.block in initialize((eval):8235)", "org.jruby.RubyArray.each(org/jruby/RubyArray.java:1734)", "org.jruby.RubyArray$INVOKER$i$0$0$each.call(org/jruby/RubyArray$INVOKER$i$0$0$each.gen)", "RUBY.block in initialize((eval):8229)", "org.jruby.RubyProc.call(org/jruby/RubyProc.java:289)", "org.jruby.RubyProc.call19(org/jruby/RubyProc.java:273)", "org.jruby.RubyProc$INVOKER$i$0$0$call19.call(org/jruby/RubyProc$INVOKER$i$0$0$call19.gen)", "RUBY.block in initialize((eval):8363)", "org.jruby.RubyArray.each(org/jruby/RubyArray.java:1734)", "org.jruby.RubyArray$INVOKER$i$0$0$each.call(org/jruby/RubyArray$INVOKER$i$0$0$each.gen)", "RUBY.block in initialize((eval):8359)", "org.jruby.RubyProc.call(org/jruby/RubyProc.java:289)", "org.jruby.RubyProc.call19(org/jruby/RubyProc.java:273)", "org.jruby.RubyProc$INVOKER$i$0$0$call19.call(org/jruby/RubyProc$INVOKER$i$0$0$call19.gen)", "RUBY.block in filter_func((eval):1764)", "org.jruby.RubyProc.call(org/jruby/RubyProc.java:289)", "usr.share.logstash.logstash_minus_core.lib.logstash.pipeline.invokeOther3:filter_func(usr/share/logstash/logstash_minus_core/lib/logstash//usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:501)", "usr.share.logstash.logstash_minus_core.lib.logstash.pipeline.filter_batch(/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:501)", "RUBY.worker_loop(/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:477)", "RUBY.block in start_workers(/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:439)", "org.jruby.RubyProc.call(org/jruby/RubyProc.java:289)", "org.jruby.RubyProc.call(org/jruby/RubyProc.java:246)", "java.lang.Thread.run(java/lang/Thread.java:748)"], :thread=>"#<Thread:0xdaa67c6@/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:290 sleep>"}

I found a closed issue on github which looks similar but by my understanding this should have been fixed in current release. My nodes run NTP and it is possible that time jumps back slightly.

I never encountered this error in 5.6.0

The wider ramification of this problem (atleast for me) is that it corrupts the persistent queues as the "frozen" workers' events are apparenlty not flushed from the queue while the rest are.

In event of a logstash restart, Logstash fails to start. Please see the liked issue for details. The only option is then to delete existing queue files and restart (until the next metrics failure).

UPDATE:

Encountered the same error today too but the problem log was logged to syslog! The syslog entry is

Nov 27 02:16:59 logstash3 ntpd[12866]: 0.0.0.0 061c 0c clock_step -0.180540 s
Nov 27 02:16:59 logstash3 ntpd[12866]: 0.0.0.0 0615 05 clock_sync
Nov 27 02:16:59 logstash3 systemd: Time has been changed
Nov 27 02:16:59 logstash3 logstash: Exception in thread "[main]>worker2" java.lang.UnsupportedOperationException: Counters can not be incremented by negative values
Nov 27 02:16:59 logstash3 logstash: at org.logstash.instrument.metrics.counter.LongCounter.increment(org/logstash/instrument/metrics/counter/LongCounter.java:61)
Nov 27 02:16:59 logstash3 logstash: at java.lang.reflect.Method.invoke(java/lang/reflect/Method.java:498)
Nov 27 02:16:59 logstash3 logstash: at org.jruby.javasupport.JavaMethod.invokeDirectWithExceptionHandling(org/jruby/javasupport/JavaMethod.java:453)
Nov 27 02:16:59 logstash3 logstash: at org.jruby.javasupport.JavaMethod.tryProxyInvocation(org/jruby/javasupport/JavaMethod.java:633)
Nov 27 02:16:59 logstash3 logstash: at org.jruby.javasupport.JavaMethod.invokeDirect(org/jruby/javasupport/JavaMethod.java:311)
Nov 27 02:16:59 logstash3 logstash: at usr.share.logstash.logstash_minus_core.lib.logstash.output_delegator.invokeOther15:increment(usr/share/logstash/logstash_minus_core/lib/logstash//usr/share/logstash/logstash-core/lib/logstash/output_delegator.rb:50)
Nov 27 02:16:59 logstash3 logstash: at usr.share.logstash.logstash_minus_core.lib.logstash.output_delegator.multi_receive(/usr/share/logstash/logstash-core/lib/logstash/output_delegator.rb:50)
Nov 27 02:16:59 logstash3 logstash: at usr.share.logstash.logstash_minus_core.lib.logstash.pipeline.invokeOther8:multi_receive(usr/share/logstash/logstash_minus_core/lib/logstash//usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:538)
Nov 27 02:16:59 logstash3 logstash: at usr.share.logstash.logstash_minus_core.lib.logstash.pipeline.block in output_batch(/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:538)
Nov 27 02:16:59 logstash3 logstash: at org.jruby.RubyHash$12.visit(org/jruby/RubyHash.java:1362)
Nov 27 02:16:59 logstash3 logstash: at org.jruby.RubyHash$12.visit(org/jruby/RubyHash.java:1359)
Nov 27 02:16:59 logstash3 logstash: at org.jruby.RubyHash.visitLimited(org/jruby/RubyHash.java:662)
Nov 27 02:16:59 logstash3 logstash: at org.jruby.RubyHash.visitAll(org/jruby/RubyHash.java:647)
Nov 27 02:16:59 logstash3 logstash: at org.jruby.RubyHash.iteratorVisitAll(org/jruby/RubyHash.java:1319)
Nov 27 02:16:59 logstash3 logstash: at org.jruby.RubyHash.each_pairCommon(org/jruby/RubyHash.java:1354)
Nov 27 02:16:59 logstash3 logstash: at org.jruby.RubyHash.each(org/jruby/RubyHash.java:1343)
Nov 27 02:16:59 logstash3 logstash: at org.jruby.RubyHash$INVOKER$i$0$0$each.call(org/jruby/RubyHash$INVOKER$i$0$0$each.gen)
Nov 27 02:16:59 logstash3 logstash: at usr.share.logstash.logstash_minus_core.lib.logstash.pipeline.invokeOther14:each(usr/share/logstash/logstash_minus_core/lib/logstash//usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:536)
Nov 27 02:16:59 logstash3 logstash: at usr.share.logstash.logstash_minus_core.lib.logstash.pipeline.output_batch(/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:536)
Nov 27 02:16:59 logstash3 logstash: at RUBY.worker_loop(/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:481)
Nov 27 02:16:59 logstash3 logstash: at RUBY.block in start_workers(/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:439)
Nov 27 02:16:59 logstash3 logstash: at org.jruby.RubyProc.call(org/jruby/RubyProc.java:289)
Nov 27 02:16:59 logstash3 logstash: at org.jruby.RubyProc.call(org/jruby/RubyProc.java:246)
Nov 27 02:16:59 logstash3 logstash: at java.lang.Thread.run(java/lang/Thread.java:748)

This was repeated a bunch of time.

Corresponding logs in logstash-plain.log are:

[2017-11-27T02:16:03,354][INFO ][logstash.filters.translate] refreshing dictionary file
[2017-11-27T02:16:43,637][INFO ][logstash.filters.translate] refreshing dictionary file
[2017-11-27T10:04:14,134][WARN ][logstash.runner          ] SIGTERM received. Shutting down the agent.
[2017-11-27T10:04:19,204][WARN ][logstash.shutdownwatcher ] {"inflight_count"=>19, "stalling_thread_info"=>{"other"=>[{"thread_id"=>58, "name"=>"[main]<beats", "current_call"=>"[...]/vendor/bundle/jruby/2.3.0/gems/logstash-input-beats-5.0.2-java/lib/logstash/inputs/beats.rb:196:in `run'"}, {"thread_id"=>59, "name"=>"[main]<beats", "current_call"=>"[...]/vendor/bundle/jruby/2.3.0/gems/logstash-input-beats-5.0.2-java/lib/logstash/inputs/beats.rb:196:in `run'"}, {"thread_id"=>60, "name"=>"[main]<beats", "current_call"=>"[...]/vendor/bundle/jruby/2.3.0/gems/logstash-input-beats-5.0.2-java/lib/logstash/inputs/beats.rb:196:in `run'"}, {"thread_id"=>61, "name"=>"[main]<beats", "current_call"=>"[...]/vendor/bundle/jruby/2.3.0/gems/logstash-input-beats-5.0.2-java/lib/logstash/inputs/beats.rb:196:in `run'"}]}}
[2017-11-27T10:04:19,206][ERROR][logstash.shutdownwatcher ] The shutdown process appears to be stalled due to busy or blocked plugins. Check the logs for more information.

From 02:16 Logstash was "frozen" and remained so till the service restart was given. It then failed to start (due to corrupted persistent queues) and I had to empty the queue to start it manually.

So it seems that if NTP adjusts the time in a negative direction, it leads the metrics filters to error out and if this happens in all workers this freezes logstash. Also the mechanism of metrics filter freezing the worker somehow leaves the persistent queue in an unstable state that affects the ability to restart logstash normally.

My metrics filters is

filter {
        metrics {
                meter => [ "events" ]
                timer => { "lag" => "%{[@metadata][lag]}" }
                add_tag => [ "metric" ]
                flush_interval => 60
                clear_interval => 300
                rates => [ 1 ]
        }
}

The lag is calculated using

ruby {
        code => "event.set('[@metadata][lag]', Time.now.utc.to_f - event.get('@timestamp').to_f)
}

To work around this issue I have disabled the metrics filter.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.