Lots of SendRequestTransportExceptions caused by NodeNotConnectedExceptions


(Dave) #1

I have a 2-node Elasticsearch cluster that's been running OK for a couple days. Suddenly this evening, one of my nodes became unreachable, and my cluster health says:

# curl localhost:9200/_cluster/health?pretty
{
  "cluster_name" : "campfire.production.local",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 3761,
  "active_shards" : 3761,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 3757,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0
}

If I check out the logs of the unreachable node, I see tens of thousands of stack traces like this:

[2015-10-03 03:07:37,848][DEBUG][action.admin.indices.stats] [node-1d56ac14-2323-411d-b543-462408c202b3] [events-default@2014.06.29][1], node[CFCnEnSKRwGlsf6-N2ECOA], [P], s[STARTED]: failed to execute [org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@340388b3]
org.elasticsearch.transport.SendRequestTransportException: [node-4cdfe7b4-3504-405d-8a93-c40519f329f4][inet[/172.31.7.164:9300]][indices:monitor/stats[s]]
    at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:286)
    at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:249)
    at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:182)
    at ...
    at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
    at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
    at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
    at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
    at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
    at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
    at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
    at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
    at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
    at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
    at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [node-4cdfe7b4-3504-405d-8a93-c40519f329f4][inet[/172.31.7.164:9300]] Node not connected
    at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:964)
    at org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:656)
    at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:276)
    ... 55 more

(172.31.7.164 is the node that is reachable). It logs that hundreds of times per second. Is this a recognizable mode of failure? How should I go about getting my cluster back online?


(Mark Walkom) #2

Is that node up and running, can you ping/telnet from the current node that is part of the cluster?
It looks like a networking issue stopped it from seeing the other node, so I'd try restarting the node-1d56ac14-2323-411d-b543-462408c202b3 node and see if it joins the other one.

Also you have way too many shards on this cluster. You really need to reduce the count of these are you will be wasting a lot of resources.


(system) #3