Elastic search nodes are going down across multiple instances

Hi -

We have a problem of Elastic nodes going down one by one around the same time frames which are hosted on different VPC's in aws. I could not figure out the pattern, i see these errors in the log files, Appreciate any help in troubleshooting the issue. We ruled out any network issues.

[2020-08-21T05:00:05,798][WARN ][r.suppressed ] [IP ADDRESS] path: /assets/_search, params: {typed_keys=true, ignore_unavailable=false, expand_wildcards=open, allow_no_indices=true, ignore_throttled=true, index=assets, search_type=query_then_fetch, batched_reduce_size=512, ccs_minimize_roundtrips=true}
org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed
at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:314) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:139) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.action.search.FetchSearchPhase.moveToNextPhase(FetchSearchPhase.java:210) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.action.search.FetchSearchPhase.lambda$innerRun$2(FetchSearchPhase.java:104) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.action.search.CountedCollector.countDown(CountedCollector.java:53) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.action.search.CountedCollector.onFailure(CountedCollector.java:76) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.action.search.FetchSearchPhase$2.onFailure(FetchSearchPhase.java:177) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:59) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:424) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1120) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.InboundHandler.lambda$handleException$2(InboundHandler.java:243) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:225) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.InboundHandler.handleException(InboundHandler.java:241) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.InboundHandler.handlerResponseError(InboundHandler.java:233) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:136) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:102) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:663) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:62) [transport-netty4-client-7.4.2.jar:7.4.2]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:328) [netty-codec-4.1.38.Final.jar:4.1.38.Final]
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:302) [netty-codec-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) [netty-handler-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty

Welcome to our community! :smiley:

That error is not related to a node going down, is there more to the log?

Hi Mark -

Thank you so much for your response. I am new to Elastic search, I will take an example of environment which has 4 nodes. Each node went down within few hours of gap, Here is the Last log content of node which went down first.

at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918) [netty-common-4.1.38.Final.jar:4.1.38.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.38.Final.jar:4.1.38.Final]
at java.lang.Thread.run(Thread.java:830) [?:?]
Caused by: org.elasticsearch.ElasticsearchException$1: Task cancelled before it started: by user request
at org.elasticsearch.ElasticsearchException.guessRootCauses(ElasticsearchException.java:639) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:137) [elasticsearch-7.4.2.jar:7.4.2]
... 40 more
Caused by: java.lang.IllegalStateException: Task cancelled before it started: by user request
at org.elasticsearch.tasks.TaskManager.registerCancellableTask(TaskManager.java:141) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.tasks.TaskManager.register(TaskManager.java:122) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:60) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.TransportService.sendLocalRequest(TransportService.java:746) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.TransportService.access$100(TransportService.java:74) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.TransportService$3.sendRequest(TransportService.java:127) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:692) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:602) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.TransportService.sendChildRequest(TransportService.java:646) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.TransportService.sendChildRequest(TransportService.java:637) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.action.search.SearchTransportService.sendExecuteQuery(SearchTransportService.java:137) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.action.search.SearchQueryThenFetchAsyncAction.executePhaseOnShard(SearchQueryThenFetchAsyncAction.java:54) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.action.search.InitialSearchPhase.lambda$performPhaseOnShard$4(InitialSearchPhase.java:259) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.action.search.InitialSearchPhase.performPhaseOnShard(InitialSearchPhase.java:294) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.action.search.InitialSearchPhase.onShardFailure(InitialSearchPhase.java:114) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.action.search.InitialSearchPhase.access$200(InitialSearchPhase.java:50) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.action.search.InitialSearchPhase$2.onFailure(InitialSearchPhase.java:273) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.action.search.SearchExecutionStatsCollector.onFailure(SearchExecutionStatsCollector.java:73) ~[elasticsearch-7.4.2.jar:7.4.2]
... 35 more
[2020-08-21T05:00:05,807][DEBUG][o.e.a.a.c.n.t.c.TransportCancelTasksAction] [$IP_ADDRESS] Removing ban for the parent [fLAA2wY9QRmXWI3Hblky4A:3368093] on the node [_tATxjpzRdW9a9RPg4mCWg]
[2020-08-21T05:55:57,126][WARN ][o.e.c.InternalClusterInfoService] [$IP_ADDRESS] Failed to update node information for ClusterInfoUpdateJob within 15s timeout
[2020-08-21T05:55:57,126][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [$IP_ADDRESS] failed to execute on node [v8dQUnFsSZ6ZDoDNYY3nHA]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [10.44.226.164][10.44.226.164:9300][cluster:monitor/nodes/stats[n]] request_id [4397853] timed out after [36554ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1022) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) [elasticsearch-7.4.2.jar:7.4.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:830) [?:?]
[2020-08-21T05:55:57,126][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [$IP_ADDRESS] collector [cluster_stats] timed out when collecting data
[2020-08-21T05:55:57,726][DEBUG][o.e.a.a.c.n.t.c.TransportCancelTasksAction] [$IP_ADDRESS] Received ban for the parent [fLAA2wY9QRmXWI3Hblky4A:3400713] on the node [_tATxjpzRdW9a9RPg4mCWg], reason: [by user request]
[2020-08-21T05:55:58,315][DEBUG][o.e.a.a.c.n.t.c.TransportCancelTasksAction] [$IP_ADDRESS] Received ban for the parent [5FKJq3EdSqazPIIXm5Dd0Q:3180189] on the node [_tATxjpzRdW9a9RPg4mCWg], reason: [by user request]
[2020-08-21T05:55:58,739][INFO ][o.e.x.m.p.NativeController] [$IP_ADDRESS] Native controller process has stopped - no new native processes can be started

I see some transport remote exceptions in server.json logs just before the servuce died. Does it imply anything?

{"type": "server", "timestamp": "2020-08-21T05:00:05,807Z", "level": "DEBUG", "component": "o.e.a.a.c.n.t.c.TransportCancelTasksAction", "cluster.name": "everestsrches-prod-x", "node.name": "IP ADDRESS", "message": "Removing ban for the parent [fLAA2wY9QRmXWI3Hblky4A:3368093] on the node [_tATxjpzRdW9a9RPg4mCWg]", "cluster.uuid": "V2cgLcOnQyKPK0iTpsYScQ", "node.id": "_tATxjpzRdW9a9RPg4mCWg" }
{"type": "server", "timestamp": "2020-08-21T05:55:57,126Z", "level": "WARN", "component": "o.e.c.InternalClusterInfoService", "cluster.name": "everestsrches-prod-x", "node.name": "IP ADDRESS", "message": "Failed to update node information for ClusterInfoUpdateJob within 15s timeout", "cluster.uuid": "V2cgLcOnQyKPK0iTpsYScQ", "node.id": "_tATxjpzRdW9a9RPg4mCWg" }
{"type": "server", "timestamp": "2020-08-21T05:55:57,126Z", "level": "ERROR", "component": "o.e.x.m.c.c.ClusterStatsCollector", "cluster.name": "everestsrches-prod-x", "node.name": "IP ADDRESS", "message": "collector [cluster_stats] timed out when collecting data", "cluster.uuid": "V2cgLcOnQyKPK0iTpsYScQ", "node.id": "_tATxjpzRdW9a9RPg4mCWg" }
{"type": "server", "timestamp": "2020-08-21T05:55:57,126Z", "level": "DEBUG", "component": "o.e.a.a.c.n.s.TransportNodesStatsAction", "cluster.name": "everestsrches-prod-x", "node.name": "IP ADDRESS", "message": "failed to execute on node [v8dQUnFsSZ6ZDoDNYY3nHA]", "cluster.uuid": "V2cgLcOnQyKPK0iTpsYScQ", "node.id": "_tATxjpzRdW9a9RPg4mCWg" ,
"stacktrace": ["org.elasticsearch.transport.ReceiveTimeoutTransportException: [10.44.226.164][10.44.226.164:9300][cluster:monitor/nodes/stats[n]] request_id [4397853] timed out after [36554ms]",
"at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1022) [elasticsearch-7.4.2.jar:7.4.2]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) [elasticsearch-7.4.2.jar:7.4.2]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]",
"at java.lang.Thread.run(Thread.java:830) [?:?]"] }
{"type": "server", "timestamp": "2020-08-21T05:55:57,726Z", "level": "DEBUG", "component": "o.e.a.a.c.n.t.c.TransportCancelTasksAction", "cluster.name": "everestsrches-prod-x", "node.name": "IP ADDRESS", "message": "Received ban for the parent [fLAA2wY9QRmXWI3Hblky4A:3400713] on the node [_tATxjpzRdW9a9RPg4mCWg], reason: [by user request]", "cluster.uuid": "V2cgLcOnQyKPK0iTpsYScQ", "node.id": "_tATxjpzRdW9a9RPg4mCWg" }
{"type": "server", "timestamp": "2020-08-21T05:55:58,315Z", "level": "DEBUG", "component": "o.e.a.a.c.n.t.c.TransportCancelTasksAction", "cluster.name": "everestsrches-prod-x", "node.name": "IP ADDRESS", "message": "Received ban for the parent [5FKJq3EdSqazPIIXm5Dd0Q:3180189] on the node [_tATxjpzRdW9a9RPg4mCWg], reason: [by user request]", "cluster.uuid": "V2cgLcOnQyKPK0iTpsYScQ", "node.id": "_tATxjpzRdW9a9RPg4mCWg" }
{"type": "server", "timestamp": "2020-08-21T05:55:58,739Z", "level": "INFO", "component": "o.e.x.m.p.NativeController", "cluster.name": "everestsrches-prod-x", "node.name": "IP ADDRESS", "message": "Native controller process has stopped - no new native processes can be started", "cluster.uuid": "V2cgLcOnQyKPK0iTpsYScQ", "node.id": "_tATxjpzRdW9a9RPg4mCWg" }

What's the output from _cat/allocation?v?

Here is the pattern - Lets say in a 3 node cluster one node is going down first , if it is a master one of the 2 surviving nodes is being elected as master. After few hours 2 nd node is going down and 3rd node is also going down as it requires atleast 2 nodes to elect master.

When one node goes down i see shards are being assigned to surviving nodes. Any kind of additional logging would help capturing more information? we see lot of remote exception errors , do they cause any problems? Are there any network related parameters we need to look into?

curl IPADDRESS1:9200/_cat/allocation?v
shards disk.indices disk.used disk.avail disk.total disk.percent host ip node
65 9.1gb 18.5gb 30.6gb 49.1gb 37 IP IP IP
65 8.9gb 16.9gb 32.2gb 49.1gb 34 IP IP IP

Please format your code/logs/config using the </> button, or markdown style back ticks. It helps to make things easy to read which helps us help you :slight_smile:

The logs you have provided don't really provide much detail on what may be causing this. Is there anything else prior to that?