Hi there.
I am using ES 1.3.1 in a 3 node setup to power some search functionality on a website (in fact 45 websites) - each instance of ES is running on the JDK 1.8_161 via the Elastic Search 64bit WIndows Service wrapper.
I have 2 x 64 GB machines with a heap size of 16 GB min and max and one 8 x GB machine with a heap size of 4 GB min and max.
I have a set of IIS sites (45 of them) running on the same box (not ideal I know, it will be moved off soon)
I am using Marvel for analysis
What happens?
Seeing strong performance and stability for what can be hours and hours, and then some event occurs which literally within minutes sends the cluster into a total frenzy and ultimately crashes it (have to do a hard reset on the service / iisreset)
Some metrics from Marvel:
You can see where it took off and crashed - zooming in:
Digging into the search thread queue, it just goes haywire
JVM stats
One of many, many exceptions in the log
[2018-03-02 16:13:45,165][DEBUG][action.search.type ] [Isis] [4885391] Failed to execute fetch phase
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution (queue capacity 1000) on org.elasticsearch.search.action.SearchServiceTransportAction$23@1cb1a4dc
at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:62)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
at org.elasticsearch.search.action.SearchServiceTransportAction.execute(SearchServiceTransportAction.java:509)
at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteFetch(SearchServiceTransportAction.java:407)
at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.executeFetch(TransportSearchQueryThenFetchAction.java:107)
at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.moveToSecondPhase(TransportSearchQueryThenFetchAction.java:102)
at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.innerMoveToSecondPhase(TransportSearchTypeAction.java:404)
at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.onFirstPhaseResult(TransportSearchTypeAction.java:198)
at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$1.onResult(TransportSearchTypeAction.java:174)
at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$1.onResult(TransportSearchTypeAction.java:171)
at org.elasticsearch.search.action.SearchServiceTransportAction$6.handleResponse(SearchServiceTransportAction.java:219)
at org.elasticsearch.search.action.SearchServiceTransportAction$6.handleResponse(SearchServiceTransportAction.java:210)
at org.elasticsearch.transport.netty.MessageChannelHandler.handleResponse(MessageChannelHandler.java:158)
at org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:127)
at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:296)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Has anyone got any good advice for troubleshooting this one - is there any way I can see the exact requests that were taking so long to process that they overloaded the queue at that time?