I have a 2-node Elasticsearch cluster that's been running OK for a couple days. Suddenly this evening, one of my nodes became unreachable, and my cluster health says:
# curl localhost:9200/_cluster/health?pretty
{
"cluster_name" : "campfire.production.local",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 3761,
"active_shards" : 3761,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 3757,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0
}
If I check out the logs of the unreachable node, I see tens of thousands of stack traces like this:
[2015-10-03 03:07:37,848][DEBUG][action.admin.indices.stats] [node-1d56ac14-2323-411d-b543-462408c202b3] [events-default@2014.06.29][1], node[CFCnEnSKRwGlsf6-N2ECOA], [P], s[STARTED]: failed to execute [org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@340388b3]
org.elasticsearch.transport.SendRequestTransportException: [node-4cdfe7b4-3504-405d-8a93-c40519f329f4][inet[/172.31.7.164:9300]][indices:monitor/stats[s]]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:286)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:249)
at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction.performOperation(TransportBroadcastOperationAction.java:182)
at ...
at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [node-4cdfe7b4-3504-405d-8a93-c40519f329f4][inet[/172.31.7.164:9300]] Node not connected
at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:964)
at org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:656)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:276)
... 55 more
(172.31.7.164 is the node that is reachable). It logs that hundreds of times per second. Is this a recognizable mode of failure? How should I go about getting my cluster back online?