Sudden spike then total failure

mpax · March 2, 2018, 5:26pm

Hi there.

I am using ES 1.3.1 in a 3 node setup to power some search functionality on a website (in fact 45 websites) - each instance of ES is running on the JDK 1.8_161 via the Elastic Search 64bit WIndows Service wrapper.

I have 2 x 64 GB machines with a heap size of 16 GB min and max and one 8 x GB machine with a heap size of 4 GB min and max.

I have a set of IIS sites (45 of them) running on the same box (not ideal I know, it will be moved off soon)

I am using Marvel for analysis

What happens?

Seeing strong performance and stability for what can be hours and hours, and then some event occurs which literally within minutes sends the cluster into a total frenzy and ultimately crashes it (have to do a hard reset on the service / iisreset)

Some metrics from Marvel:

You can see where it took off and crashed - zooming in:

Digging into the search thread queue, it just goes haywire

JVM stats

One of many, many exceptions in the log

[2018-03-02 16:13:45,165][DEBUG][action.search.type       ] [Isis] [4885391] Failed to execute fetch phase
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution (queue capacity 1000) on org.elasticsearch.search.action.SearchServiceTransportAction$23@1cb1a4dc
	at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:62)
	at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
	at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
	at org.elasticsearch.search.action.SearchServiceTransportAction.execute(SearchServiceTransportAction.java:509)
	at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteFetch(SearchServiceTransportAction.java:407)
	at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.executeFetch(TransportSearchQueryThenFetchAction.java:107)
	at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.moveToSecondPhase(TransportSearchQueryThenFetchAction.java:102)
	at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.innerMoveToSecondPhase(TransportSearchTypeAction.java:404)
	at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.onFirstPhaseResult(TransportSearchTypeAction.java:198)
	at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$1.onResult(TransportSearchTypeAction.java:174)
	at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$1.onResult(TransportSearchTypeAction.java:171)
	at org.elasticsearch.search.action.SearchServiceTransportAction$6.handleResponse(SearchServiceTransportAction.java:219)
	at org.elasticsearch.search.action.SearchServiceTransportAction$6.handleResponse(SearchServiceTransportAction.java:210)
	at org.elasticsearch.transport.netty.MessageChannelHandler.handleResponse(MessageChannelHandler.java:158)
	at org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:127)
	at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
	at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
	at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
	at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:296)
	at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
	at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
	at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
	at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
	at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
	at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
	at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
	at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
	at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
	at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
	at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
	at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
	at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
	at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
	at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Has anyone got any good advice for troubleshooting this one - is there any way I can see the exact requests that were taking so long to process that they overloaded the queue at that time?

dadoonet · March 2, 2018, 5:53pm

Few things:

I have a set of IIS sites (45 of them) running on the same box (not ideal I know, it will be moved off soon)

Yeah. That's a bad idea.

I have 2 x 64 GB machines with a heap size of 16 GB min and max and one 8 x GB machine with a heap size of 4 GB min and max.

That's also a bad idea. You should have consistent nodes. If you don't really need this other node, my suggestion is to keep it as a master only node.

I am using ES 1.3.1

This version is not supported anymore. Better to upgrade to elasticsearch 6.2.2.

mpax · March 2, 2018, 6:02pm

I can change the other node to a master only, the rest of the work (upgrade, etc) is non-trivial sadly.

I'll enable slow query logging with a low threshold to see what queries take a long time and cause the queue to reject.

system · March 30, 2018, 6:02pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Rejected execution (queue capacity 1000) Elasticsearch	2	27091	July 18, 2017
Production Cluster Suddenly Crashed Last Night Elasticsearch	5	1675	July 6, 2017
Shard Failing : queue error Elasticsearch	8	750	July 5, 2017
Queue size Elasticsearch	6	680	July 6, 2017
Courier Fetch x of y shards are failed Elasticsearch	9	1657	July 5, 2017

Sudden spike then total failure

Related topics