Sudden spike then total failure

Hi there.

I am using ES 1.3.1 in a 3 node setup to power some search functionality on a website (in fact 45 websites) - each instance of ES is running on the JDK 1.8_161 via the Elastic Search 64bit WIndows Service wrapper.

I have 2 x 64 GB machines with a heap size of 16 GB min and max and one 8 x GB machine with a heap size of 4 GB min and max.

I have a set of IIS sites (45 of them) running on the same box (not ideal I know, it will be moved off soon)

I am using Marvel for analysis

What happens?

Seeing strong performance and stability for what can be hours and hours, and then some event occurs which literally within minutes sends the cluster into a total frenzy and ultimately crashes it (have to do a hard reset on the service / iisreset)

Some metrics from Marvel:

You can see where it took off and crashed - zooming in:

36

Digging into the search thread queue, it just goes haywire

53

JVM stats

30

One of many, many exceptions in the log

[2018-03-02 16:13:45,165][DEBUG][action.search.type       ] [Isis] [4885391] Failed to execute fetch phase
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution (queue capacity 1000) on org.elasticsearch.search.action.SearchServiceTransportAction$23@1cb1a4dc
	at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:62)
	at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
	at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
	at org.elasticsearch.search.action.SearchServiceTransportAction.execute(SearchServiceTransportAction.java:509)
	at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteFetch(SearchServiceTransportAction.java:407)
	at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.executeFetch(TransportSearchQueryThenFetchAction.java:107)
	at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.moveToSecondPhase(TransportSearchQueryThenFetchAction.java:102)
	at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.innerMoveToSecondPhase(TransportSearchTypeAction.java:404)
	at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.onFirstPhaseResult(TransportSearchTypeAction.java:198)
	at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$1.onResult(TransportSearchTypeAction.java:174)
	at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$1.onResult(TransportSearchTypeAction.java:171)
	at org.elasticsearch.search.action.SearchServiceTransportAction$6.handleResponse(SearchServiceTransportAction.java:219)
	at org.elasticsearch.search.action.SearchServiceTransportAction$6.handleResponse(SearchServiceTransportAction.java:210)
	at org.elasticsearch.transport.netty.MessageChannelHandler.handleResponse(MessageChannelHandler.java:158)
	at org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:127)
	at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
	at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
	at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
	at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:296)
	at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
	at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
	at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
	at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
	at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
	at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
	at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
	at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
	at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
	at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
	at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
	at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
	at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
	at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
	at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Has anyone got any good advice for troubleshooting this one - is there any way I can see the exact requests that were taking so long to process that they overloaded the queue at that time?

Few things:

I have a set of IIS sites (45 of them) running on the same box (not ideal I know, it will be moved off soon)

Yeah. That's a bad idea.

I have 2 x 64 GB machines with a heap size of 16 GB min and max and one 8 x GB machine with a heap size of 4 GB min and max.

That's also a bad idea. You should have consistent nodes. If you don't really need this other node, my suggestion is to keep it as a master only node.

I am using ES 1.3.1

This version is not supported anymore. Better to upgrade to elasticsearch 6.2.2.

I can change the other node to a master only, the rest of the work (upgrade, etc) is non-trivial sadly.

I'll enable slow query logging with a low threshold to see what queries take a long time and cause the queue to reject.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.