Elastic search nodes are going down across multiple instances

prasanthpola · August 24, 2020, 3:29pm

Hi -

We have a problem of Elastic nodes going down one by one around the same time frames which are hosted on different VPC's in aws. I could not figure out the pattern, i see these errors in the log files, Appreciate any help in troubleshooting the issue. We ruled out any network issues.

[2020-08-21T05:00:05,798][WARN ][r.suppressed ] [IP ADDRESS] path: /assets/_search, params: {typed_keys=true, ignore_unavailable=false, expand_wildcards=open, allow_no_indices=true, ignore_throttled=true, index=assets, search_type=query_then_fetch, batched_reduce_size=512, ccs_minimize_roundtrips=true}
org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed
at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:314) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:139) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.action.search.FetchSearchPhase.moveToNextPhase(FetchSearchPhase.java:210) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.action.search.FetchSearchPhase.lambda$innerRun$2(FetchSearchPhase.java:104) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.action.search.CountedCollector.countDown(CountedCollector.java:53) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.action.search.CountedCollector.onFailure(CountedCollector.java:76) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.action.search.FetchSearchPhase$2.onFailure(FetchSearchPhase.java:177) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:59) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:424) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1120) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.InboundHandler.lambda$handleException$2(InboundHandler.java:243) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:225) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.InboundHandler.handleException(InboundHandler.java:241) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.InboundHandler.handlerResponseError(InboundHandler.java:233) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:136) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:102) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:663) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:62) [transport-netty4-client-7.4.2.jar:7.4.2]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:328) [netty-codec-4.1.38.Final.jar:4.1.38.Final]
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:302) [netty-codec-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) [netty-handler-4.1.38.Final.jar:4.1.38.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty

warkolm · August 24, 2020, 9:42pm

Welcome to our community!

That error is not related to a node going down, is there more to the log?

prasanthpola · August 24, 2020, 11:19pm

Hi Mark -

Thank you so much for your response. I am new to Elastic search, I will take an example of environment which has 4 nodes. Each node went down within few hours of gap, Here is the Last log content of node which went down first.

at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918) [netty-common-4.1.38.Final.jar:4.1.38.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.38.Final.jar:4.1.38.Final]
at java.lang.Thread.run(Thread.java:830) [?:?]
Caused by: org.elasticsearch.ElasticsearchException$1: Task cancelled before it started: by user request
at org.elasticsearch.ElasticsearchException.guessRootCauses(ElasticsearchException.java:639) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:137) [elasticsearch-7.4.2.jar:7.4.2]
... 40 more
Caused by: java.lang.IllegalStateException: Task cancelled before it started: by user request
at org.elasticsearch.tasks.TaskManager.registerCancellableTask(TaskManager.java:141) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.tasks.TaskManager.register(TaskManager.java:122) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:60) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.TransportService.sendLocalRequest(TransportService.java:746) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.TransportService.access$100(TransportService.java:74) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.TransportService$3.sendRequest(TransportService.java:127) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:692) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:602) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.TransportService.sendChildRequest(TransportService.java:646) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.transport.TransportService.sendChildRequest(TransportService.java:637) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.action.search.SearchTransportService.sendExecuteQuery(SearchTransportService.java:137) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.action.search.SearchQueryThenFetchAsyncAction.executePhaseOnShard(SearchQueryThenFetchAsyncAction.java:54) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.action.search.InitialSearchPhase.lambda$performPhaseOnShard$4(InitialSearchPhase.java:259) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.action.search.InitialSearchPhase.performPhaseOnShard(InitialSearchPhase.java:294) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.action.search.InitialSearchPhase.onShardFailure(InitialSearchPhase.java:114) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.action.search.InitialSearchPhase.access$200(InitialSearchPhase.java:50) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.action.search.InitialSearchPhase$2.onFailure(InitialSearchPhase.java:273) ~[elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.action.search.SearchExecutionStatsCollector.onFailure(SearchExecutionStatsCollector.java:73) ~[elasticsearch-7.4.2.jar:7.4.2]
... 35 more
[2020-08-21T05:00:05,807][DEBUG][o.e.a.a.c.n.t.c.TransportCancelTasksAction] [$IP_ADDRESS] Removing ban for the parent [fLAA2wY9QRmXWI3Hblky4A:3368093] on the node [_tATxjpzRdW9a9RPg4mCWg]
[2020-08-21T05:55:57,126][WARN ][o.e.c.InternalClusterInfoService] [$IP_ADDRESS] Failed to update node information for ClusterInfoUpdateJob within 15s timeout
[2020-08-21T05:55:57,126][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [$IP_ADDRESS] failed to execute on node [v8dQUnFsSZ6ZDoDNYY3nHA]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [10.44.226.164][10.44.226.164:9300][cluster:monitor/nodes/stats[n]] request_id [4397853] timed out after [36554ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1022) [elasticsearch-7.4.2.jar:7.4.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) [elasticsearch-7.4.2.jar:7.4.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:830) [?:?]
[2020-08-21T05:55:57,126][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [$IP_ADDRESS] collector [cluster_stats] timed out when collecting data
[2020-08-21T05:55:57,726][DEBUG][o.e.a.a.c.n.t.c.TransportCancelTasksAction] [$IP_ADDRESS] Received ban for the parent [fLAA2wY9QRmXWI3Hblky4A:3400713] on the node [_tATxjpzRdW9a9RPg4mCWg], reason: [by user request]
[2020-08-21T05:55:58,315][DEBUG][o.e.a.a.c.n.t.c.TransportCancelTasksAction] [$IP_ADDRESS] Received ban for the parent [5FKJq3EdSqazPIIXm5Dd0Q:3180189] on the node [_tATxjpzRdW9a9RPg4mCWg], reason: [by user request]
[2020-08-21T05:55:58,739][INFO ][o.e.x.m.p.NativeController] [$IP_ADDRESS] Native controller process has stopped - no new native processes can be started

prasanthpola · August 25, 2020, 9:48pm

I see some transport remote exceptions in server.json logs just before the servuce died. Does it imply anything?

{"type": "server", "timestamp": "2020-08-21T05:00:05,807Z", "level": "DEBUG", "component": "o.e.a.a.c.n.t.c.TransportCancelTasksAction", "cluster.name": "everestsrches-prod-x", "node.name": "IP ADDRESS", "message": "Removing ban for the parent [fLAA2wY9QRmXWI3Hblky4A:3368093] on the node [_tATxjpzRdW9a9RPg4mCWg]", "cluster.uuid": "V2cgLcOnQyKPK0iTpsYScQ", "node.id": "_tATxjpzRdW9a9RPg4mCWg" }
{"type": "server", "timestamp": "2020-08-21T05:55:57,126Z", "level": "WARN", "component": "o.e.c.InternalClusterInfoService", "cluster.name": "everestsrches-prod-x", "node.name": "IP ADDRESS", "message": "Failed to update node information for ClusterInfoUpdateJob within 15s timeout", "cluster.uuid": "V2cgLcOnQyKPK0iTpsYScQ", "node.id": "_tATxjpzRdW9a9RPg4mCWg" }
{"type": "server", "timestamp": "2020-08-21T05:55:57,126Z", "level": "ERROR", "component": "o.e.x.m.c.c.ClusterStatsCollector", "cluster.name": "everestsrches-prod-x", "node.name": "IP ADDRESS", "message": "collector [cluster_stats] timed out when collecting data", "cluster.uuid": "V2cgLcOnQyKPK0iTpsYScQ", "node.id": "_tATxjpzRdW9a9RPg4mCWg" }
{"type": "server", "timestamp": "2020-08-21T05:55:57,126Z", "level": "DEBUG", "component": "o.e.a.a.c.n.s.TransportNodesStatsAction", "cluster.name": "everestsrches-prod-x", "node.name": "IP ADDRESS", "message": "failed to execute on node [v8dQUnFsSZ6ZDoDNYY3nHA]", "cluster.uuid": "V2cgLcOnQyKPK0iTpsYScQ", "node.id": "_tATxjpzRdW9a9RPg4mCWg" ,
"stacktrace": ["org.elasticsearch.transport.ReceiveTimeoutTransportException: [10.44.226.164][10.44.226.164:9300][cluster:monitor/nodes/stats[n]] request_id [4397853] timed out after [36554ms]",
"at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1022) [elasticsearch-7.4.2.jar:7.4.2]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) [elasticsearch-7.4.2.jar:7.4.2]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]",
"at java.lang.Thread.run(Thread.java:830) [?:?]"] }
{"type": "server", "timestamp": "2020-08-21T05:55:57,726Z", "level": "DEBUG", "component": "o.e.a.a.c.n.t.c.TransportCancelTasksAction", "cluster.name": "everestsrches-prod-x", "node.name": "IP ADDRESS", "message": "Received ban for the parent [fLAA2wY9QRmXWI3Hblky4A:3400713] on the node [_tATxjpzRdW9a9RPg4mCWg], reason: [by user request]", "cluster.uuid": "V2cgLcOnQyKPK0iTpsYScQ", "node.id": "_tATxjpzRdW9a9RPg4mCWg" }
{"type": "server", "timestamp": "2020-08-21T05:55:58,315Z", "level": "DEBUG", "component": "o.e.a.a.c.n.t.c.TransportCancelTasksAction", "cluster.name": "everestsrches-prod-x", "node.name": "IP ADDRESS", "message": "Received ban for the parent [5FKJq3EdSqazPIIXm5Dd0Q:3180189] on the node [_tATxjpzRdW9a9RPg4mCWg], reason: [by user request]", "cluster.uuid": "V2cgLcOnQyKPK0iTpsYScQ", "node.id": "_tATxjpzRdW9a9RPg4mCWg" }
{"type": "server", "timestamp": "2020-08-21T05:55:58,739Z", "level": "INFO", "component": "o.e.x.m.p.NativeController", "cluster.name": "everestsrches-prod-x", "node.name": "IP ADDRESS", "message": "Native controller process has stopped - no new native processes can be started", "cluster.uuid": "V2cgLcOnQyKPK0iTpsYScQ", "node.id": "_tATxjpzRdW9a9RPg4mCWg" }

warkolm · August 25, 2020, 10:32pm

What's the output from _cat/allocation?v?

prasanthpola · August 26, 2020, 4:59pm

Here is the pattern - Lets say in a 3 node cluster one node is going down first , if it is a master one of the 2 surviving nodes is being elected as master. After few hours 2 nd node is going down and 3rd node is also going down as it requires atleast 2 nodes to elect master.

When one node goes down i see shards are being assigned to surviving nodes. Any kind of additional logging would help capturing more information? we see lot of remote exception errors , do they cause any problems? Are there any network related parameters we need to look into?

curl IPADDRESS1:9200/_cat/allocation?v
shards disk.indices disk.used disk.avail disk.total disk.percent host ip node
65 9.1gb 18.5gb 30.6gb 49.1gb 37 IP IP IP
65 8.9gb 16.9gb 32.2gb 49.1gb 34 IP IP IP

warkolm · August 26, 2020, 9:39pm

Please format your code/logs/config using the </> button, or markdown style back ticks. It helps to make things easy to read which helps us help you

The logs you have provided don't really provide much detail on what may be causing this. Is there anything else prior to that?

system · September 23, 2020, 9:39pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch error all shards failed on single node Elasticsearch	7	1364	March 20, 2023
ElasticSearch Instance is going down periodically with no errors in logs Elasticsearch	8	535	December 5, 2022
Another odd ES freak out Elasticsearch	6	533	July 6, 2017
Cluster availability Elasticsearch	5	370	July 6, 2017
Elasticsearch nodes automatically disconnected Elasticsearch	2	706	August 10, 2021

Elastic search nodes are going down across multiple instances

Related topics