Our cluster crushed when a node cannot reached

chenjinyuan87 · September 24, 2015, 2:51pm

The only thing we know now is one of our machine may exists some problem on it's network.
When that happen, all the data nodes in the elasticsearch clusters begin to reconnect it endlessly.
The logs shows like this:
[2015-09-23 17:48:01,930][WARN ][discovery.zen.ping.multicast] [bak4] failed to connect to requesting node [Ox][ma0VLbs6TfGWrTdSbpfeBQ][TJ-app1][inet[/192.168.2.167:9300]]{client=true, data=false}
...
[2015-09-23 17:48:01,930][WARN ][transport.netty ] [bak4] exception caught on transport layer [[id: 0x95434756]], closing connection
java.net.NoRouteToHostException: No route to host
...

And we found thousands of threads waiting
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
at org.elasticsearch.common.util.concurrent.KeyedLock.acquire(KeyedLock.
java:64)
at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyT
ransport.java:649)
at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyT
ransport.java:630)
at org.elasticsearch.transport.TransportService.connectToNode(TransportS
ervice.java:149)
...

At last, all the data nodes crushed for OOM error, because no more threads can be created.

Our elasticsearch version is 1.3.4.
Is this a known bug? Or we have done something error causing this problem?

Topic		Replies	Views
Node stuck in cluster after it crashed Elasticsearch	2	336	July 6, 2017
(ES 0.90.1) Cannot connect to elasticsearch cluster after a node is removed Elasticsearch	10	733	July 6, 2017
ElasticSearch 0.92 issue when stop Client Node Elasticsearch	1	331	July 6, 2017
Failed to connect to node [..], removed from node list Elasticsearch	3	3293	July 6, 2017
Elastic search cluster throwing error Elasticsearch	2	1113	July 5, 2017

Our cluster crushed when a node cannot reached

Related topics