i have 3 logstash 5.3.2 configured to fetch events from some kafka queues and receives also filebeat beat events.
sometimes one or more logstash lock up, they stop processing events, send heartbeats events, nothing... but they do not die, i can still fetch API calls, heap size is good, no useful log the logstash logs, everything looks normal except it just do not process any event
using the monitoring API, i can see this all workers doing something like this:
"hot_threads" : {
"time" : "2017-07-05T15:09:11+00:00",
"busiest_threads" : 3,
"threads" : [ {
"name" : "[main]>worker12",
"percent_of_cpu_time" : 62.31,
"state" : "runnable",
"traces" : [ "org.apache.commons.collections4.map.AbstractHashedMap.ensureCapacity(AbstractHashedMap.java:648)", "org.apache.commons.collections4.map.AbstractHashedMap.che
ckCapacity(AbstractHashedMap.java:614)", "org.apache.commons.collections4.map.AbstractHashedMap.addMapping(AbstractHashedMap.java:519)", "org.apache.commons.collections4.map.LRU
Map.addMapping(LRUMap.java:351)", "org.apache.commons.collections4.map.AbstractHashedMap.put(AbstractHashedMap.java:288)", "org.logstash.uaparser.CachingParser.parse(CachingPars
er.java:84)", "sun.reflect.GeneratedMethodAccessor54.invoke(Unknown Source)", "sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)", "java.lang
.reflect.Method.invoke(Method.java:498)", "org.jruby.javasupport.JavaMethod.invokeDirectWithExceptionHandling(JavaMethod.java:451)" ]
}, {
"name" : "[main]>worker11",
"percent_of_cpu_time" : 62.28,
"state" : "runnable",
"traces" : [ "org.apache.commons.collections4.map.AbstractHashedMap.getEntry(AbstractHashedMap.java:461)", "org.apache.commons.collections4.map.AbstractLinkedMap.getEntry(
AbstractLinkedMap.java:206)", "org.apache.commons.collections4.map.LRUMap.get(LRUMap.java:244)", "org.apache.commons.collections4.map.LRUMap.get(LRUMap.java:227)", "org.logstash
.uaparser.CachingParser.parse(CachingParser.java:79)", "sun.reflect.GeneratedMethodAccessor54.invoke(Unknown Source)", "sun.reflect.DelegatingMethodAccessorImpl.invoke(Delegatin
gMethodAccessorImpl.java:43)", "java.lang.reflect.Method.invoke(Method.java:498)", "org.jruby.javasupport.JavaMethod.invokeDirectWithExceptionHandling(JavaMethod.java:451)", "or
g.jruby.javasupport.JavaMethod.invokeDirect(JavaMethod.java:312)" ]
}, {
"name" : "[main]>worker7",
"percent_of_cpu_time" : 62.22,
"state" : "runnable",
"traces" : [ "org.apache.commons.collections4.map.AbstractHashedMap.getEntry(AbstractHashedMap.java:461)", "org.apache.commons.collections4.map.AbstractLinkedMap.getEntry(
AbstractLinkedMap.java:206)", "org.apache.commons.collections4.map.LRUMap.get(LRUMap.java:244)", "org.apache.commons.collections4.map.LRUMap.get(LRUMap.java:227)", "org.logstash
.uaparser.CachingParser.parse(CachingParser.java:79)", "sun.reflect.GeneratedMethodAccessor54.invoke(Unknown Source)", "sun.reflect.DelegatingMethodAccessorImpl.invoke(Delegatin
gMethodAccessorImpl.java:43)", "java.lang.reflect.Method.invoke(Method.java:498)", "org.jruby.javasupport.JavaMethod.invokeDirectWithExceptionHandling(JavaMethod.java:451)", "or
g.jruby.javasupport.JavaMethod.invokeDirect(JavaMethod.java:312)" ]
} ]
}
i have been getting some errors about user.agent for several weeks, but i have days where this messages show up a lot and logstash do not lock up... but maybe it is related:
[2017-07-05T13:39:42,903][ERROR][logstash.filters.useragent] Uknown error while parsing user agent data {:exception=>java.lang.IllegalStateException: Entry.next=null, data[removeIndex]=null previous=null key=Mozilla/5.0 (Linux; Android 6.0.1; SM-N920P Build/MMB29K; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/59.0.3071.125 Mobile Safari/537.36 value={"user_agent": {"family": "Chrome Mobile", "major": "59", "minor": "0", "patch": "3071"}, "os": {"family": "Android", "major": "6", "minor": "0", "patch": "1", "patch_minor": ""}, "device": Samsung SM-N920P} size=100000 maxSize=100000 Please check that your keys are immutable, and that you have used synchronization properly. If so, then please report this to dev@commons.apache.org as a bug., :field=>"agent", :event=>2017-07-05T13:39:35.000Z nginx-aws-b02 nginx-lb-access [05/Jul/2017:13:39:35 +0000] - https www.example.com "GET /journee HTTP/1.1" 200 59273 0.026 0.026 127.0.0.1:1081 200 "https://www.example.com" "Mozilla/5.0 (Linux; Android 6.0.1; SM-N920P Build/MMB29K; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/59.0.3071.125 Mobile Safari/537.36" "-" - OK}