Lots of SendRequestTransportExceptions caused by NodeNotConnectedExceptions

dave · October 3, 2015, 3:53am

I have a 2-node Elasticsearch cluster that's been running OK for a couple days. Suddenly this evening, one of my nodes became unreachable, and my cluster health says:

# curl localhost:9200/_cluster/health?pretty
{
  "cluster_name" : "campfire.production.local",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 3761,
  "active_shards" : 3761,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 3757,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0
}

If I check out the logs of the unreachable node, I see tens of thousands of stack traces like this:

[2015-10-03 03:07:37,848][DEBUG][action.admin.indices.stats] [node-1d56ac14-2323-411d-b543-462408c202b3] [events-default@2014.06.29][1], node[CFCnEnSKRwGlsf6-N2ECOA], [P], s[STARTED]: failed to execute [org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@340388b3]
org.elasticsearch.transport.SendRequestTransportException: [node-4cdfe7b4-3504-405d-8a93-c40519f329f4][inet[/172.31.7.164:9300]][indices:monitor/stats[s]]
    at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:286)
    at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:249)
    at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:182)
    at ...
    at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
    at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
    at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
    at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
    at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
    at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
    at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
    at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
    at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
    at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
    at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [node-4cdfe7b4-3504-405d-8a93-c40519f329f4][inet[/172.31.7.164:9300]] Node not connected
    at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:964)
    at org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:656)
    at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:276)
    ... 55 more

(172.31.7.164 is the node that is reachable). It logs that hundreds of times per second. Is this a recognizable mode of failure? How should I go about getting my cluster back online?

warkolm · October 3, 2015, 11:42pm

Is that node up and running, can you ping/telnet from the current node that is part of the cluster?
It looks like a networking issue stopped it from seeing the other node, so I'd try restarting the node-1d56ac14-2323-411d-b543-462408c202b3 node and see if it joins the other one.

Also you have way too many shards on this cluster. You really need to reduce the count of these are you will be wasting a lot of resources.

Topic		Replies	Views
NodeNotConnectedException Elasticsearch	1	451	July 6, 2017
Org.elasticsearch.transport.NodeNotConnectedException Elasticsearch	6	9216	July 6, 2017
Random node failures Elasticsearch	6	3823	July 5, 2017
Seeing Frequent NodeNotConnectedException errors Elasticsearch	4	12285	July 5, 2017
ElasticSearch 0.92 issue when stop Client Node Elasticsearch	1	344	July 6, 2017

Lots of SendRequestTransportExceptions caused by NodeNotConnectedExceptions

Related topics