Found the need to restart elasticsearch service on one of our 7.6.0 data nodes today as it was in high cpu load. Maybe a general bug that bid us?
Top of only stack dump/issue seen in the log:
[2020-02-21T09:15:34,251][DEBUG][o.e.a.s.TransportSearchScrollAction] [d1r2n14] [1416773] Failed to execute query phase
org.elasticsearch.transport.RemoteTransportException: [d1r1n12][62.243.41.176:9300][indices:data/read/search[phase/query+fetch/scroll]]
Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [1416773]
at org.elasticsearch.search.SearchService.getExecutor(SearchService.java:462) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.search.SearchService.runAsync(SearchService.java:347) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:474) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.action.search.SearchTransportService.lambda$registerRequestHandler$10(SearchTransportService.java:356) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:257) ~[?:?]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:225) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.lambda$messageReceived$0(SecurityServerTransportInterceptor.java:306) ~[?:?]
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.xpack.security.authz.AuthorizationService.lambda$runRequestInterceptors$15(AuthorizationService.java:344) ~[?:?]
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:112) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:225) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:106) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:98) ~[elasticsearch-7.6.0.jar:7.6.0]
at java.util.ArrayList.forEach(ArrayList.java:1507) ~[?:?]
at org.elasticsearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:98) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:144) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:127) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.action.StepListener.innerOnResponse(StepListener.java:62) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.action.NotifyOnceListener.onResponse(NotifyOnceListener.java:40) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.xpack.security.authz.interceptor.FieldAndDocumentLevelSecurityRequestInterceptor.intercept(FieldAndDocumentLevelSecurityRequestInterceptor.java:61) ~[?:?]
at org.elasticsearch.xpack.security.authz.interceptor.UpdateRequestInterceptor.intercept(UpdateRequestInterceptor.java:23) ~[?:?]
at org.elasticsearch.xpack.security.authz.AuthorizationService.lambda$runRequestInterceptors$14(AuthorizationService.java:339) ~[?:?]
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:112) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:225) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:106) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:98) ~[elasticsearch-7.6.0.jar:7.6.0]
at java.util.ArrayList.forEach(ArrayList.java:1507) ~[?:?]
at org.elasticsearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:98) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:144) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:127) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.action.StepListener.innerOnResponse(StepListener.java:62) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.action.NotifyOnceListener.onResponse(NotifyOnceListener.java:40) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.xpack.security.authz.interceptor.FieldAndDocumentLevelSecurityRequestInterceptor.intercept(FieldAndDocumentLevelSecurityRequestInterceptor.java:61) ~[?:?]
at org.elasticsearch.xpack.security.authz.interceptor.SearchRequestInterceptor.intercept(SearchRequestInterceptor.java:19) ~[?:?]
at org.elasticsearch.xpack.security.authz.AuthorizationService.lambda$runRequestInterceptors$14(AuthorizationService.java:339) ~[?:?]
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:112) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:225) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:106) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:98) ~[elasticsearch-7.6.0.jar:7.6.0]
at java.util.ArrayList.forEach(ArrayList.java:1507) ~[?:?]
...
at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1274) [netty-handler-4.1.43.Final.jar:4.1.43.Final]
at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:503) [netty-codec-4.1.43.Final.jar:4.1.43.Final]
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:442) [netty-codec-4.1.43.Final.jar:4.1.43.Final]
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:281) [netty-codec-4.1.43.Final.jar:4.1.43.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.43.Final.jar:4.1.43.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.43.Final.jar:4.1.43.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.43.Final.jar:4.1.43.Final]
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1422) [netty-transport-4.1.43.Final.jar:4.1.43.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.43.Final.jar:4.1.43.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.43.Final.jar:4.1.43.Final]
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:931) [netty-transport-4.1.43.Final.jar:4.1.43.Final]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) [netty-transport-4.1.43.Final.jar:4.1.43.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:700) [netty-transport-4.1.43.Final.jar:4.1.43.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:600) [netty-transport-4.1.43.Final.jar:4.1.43.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:554) [netty-transport-4.1.43.Final.jar:4.1.43.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:514) [netty-transport-4.1.43.Final.jar:4.1.43.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1050) [netty-common-4.1.43.Final.jar:4.1.43.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.43.Final.jar:4.1.43.Final]
at java.lang.Thread.run(Thread.java:830) [?:?]
[2020-02-21T09:41:15,987][INFO ][o.e.n.Node ] [d1r2n14] stopping ...
[2020-02-21T09:41:16,096][INFO ][o.e.x.w.WatcherService ] [d1r2n14] stopping watch service, reason [shutdown initiated]
[2020-02-21T09:41:16,097][INFO ][o.e.x.w.WatcherLifeCycleService] [d1r2n14] watcher has stopped and shutdown
[2020-02-21T09:41:16,241][INFO ][o.e.c.c.Coordinator ] [d1r2n14] master node [{es-mst1}{90yeCMLeTb66cCVzAsz3aA}{KWa-QIHXQEmogR5ff_dNVw}{es-mst1}{62.243.41.230:9300}{lm}{ml.machine_memory=6251069440, rack=OPA3.3.16, ml.max_open_jobs=20, xpack.installed=true}] failed, restarting discovery
org.elasticsearch.transport.NodeDisconnectedException: [es-mst1][62.243.41.230:9300][disconnected] disconnected
[2020-02-21T09:41:16,357][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [d1r2n14] [controller/22056] [Main.cc@150] Ml controller exiting
[2020-02-21T09:41:16,357][INFO ][o.e.x.m.p.NativeController] [d1r2n14] Native controller process has stopped - no new native processes can be started
[2020-02-21T09:41:21,805][INFO ][o.e.n.Node ] [d1r2n14] stopped
[2020-02-21T09:41:21,805][INFO ][o.e.n.Node ] [d1r2n14] closing ...
[2020-02-21T09:41:21,840][INFO ][o.e.n.Node ] [d1r2n14] closed
Nodes hot threads is normally a good way to investigate high CPU, especially if it's not GC. Sounds like you're no longer seeing this problem, but if it recurs then GET _nodes/hot_threads?threads=99999 will show you what Elasticsearch is busy doing.
The result of the hot threads output is plain text, not JSON, so jq will be of no help to you.
(-H 'Content-Type: application/json' tells Elasticsearch that you are sending JSON in your request, but does not impose any restrictions on the format of the response)
okay, managed to trig issue again and this time I grapped a few hot thread samples not surprisingly search threads as it's triggered by attempting filter a search in the SIEM App:
# grep 'cpu usage' /tmp/es_hot_threads*
/tmp/es_hot_threads_d1r1n11.txt: 4.3% (21.7ms out of 500ms) cpu usage by thread 'elasticsearch[d1r1n11][write][T#16]'
/tmp/es_hot_threads_d1r1n11.txt: 4.3% (21.5ms out of 500ms) cpu usage by thread 'elasticsearch[d1r1n11][write][T#14]'
/tmp/es_hot_threads_d1r1n12.txt: 100.7% (503.3ms out of 500ms) cpu usage by thread 'elasticsearch[d1r1n12][search][T#25]'
/tmp/es_hot_threads_d1r1n12.txt: 100.7% (503.3ms out of 500ms) cpu usage by thread 'elasticsearch[d1r1n12][search][T#5]'
/tmp/es_hot_threads_d1r1n12.txt: 100.6% (502.8ms out of 500ms) cpu usage by thread 'elasticsearch[d1r1n12][search][T#1]'
/tmp/es_hot_threads_d1r1n14.txt: 5.7% (28.5ms out of 500ms) cpu usage by thread 'elasticsearch[d1r1n14][transport_worker][T#26]'
/tmp/es_hot_threads.txt: 100.4% (501.9ms out of 500ms) cpu usage by thread 'elasticsearch[d1r2n14][search][T#16]'
/tmp/es_hot_threads.txt: 25.9% (129.6ms out of 500ms) cpu usage by thread 'elasticsearch[d1r2n14][[winlogbeat-7.6.0-2020.02.24][2]: Lucene Merge Thread #2468]'
This isn't exactly my area of expertise, but I see mention of FieldData in that stack trace which suggests you're trying to sort or aggregate on a text field which can be very expensive and is often the wrong thing to do. Do you know what the offending query is and where it's coming from?
I've been troubleshooting a similar issue myself.
Previously I was running a 7.5.1 cluster with no issues, however after upgrading to 7.6.0 cpu usage seems to have increased dramatically on the worker nodes.
Sure thing @cpmoore. When you have more details it's probably best to open your own thread on the topic rather than adding to this one, since it might be something completely different.
Found a way to trigger the high cpu, if I enter a regexp search in SIEM before turning off KQL (like: host.name: /some-prefix.+/) then the cluster data nodes cpus spins in userland and usage reaches for the sky for quite some time
Hm or maybe it's not...
Maybe it's just a question of SOC User's impatience and clicking too fast into different SIEM views and thus generating too many outstanding queries running at the same time.
I've had a quick chat with some colleagues who know more about this than I, and they don't think this search corresponds with the stack trace you shared above. The hot thread looks like it's running a terms aggregation with execution_hint: map, but that's not the case with this search. Do you see any other searches using execution_hint?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.