Our cluster crushed when a node cannot reached


(chenjinyuan87) #1

The only thing we know now is one of our machine may exists some problem on it's network.
When that happen, all the data nodes in the elasticsearch clusters begin to reconnect it endlessly.
The logs shows like this:
[2015-09-23 17:48:01,930][WARN ][discovery.zen.ping.multicast] [bak4] failed to connect to requesting node [Ox][ma0VLbs6TfGWrTdSbpfeBQ][TJ-app1][inet[/192.168.2.167:9300]]{client=true, data=false}
...
[2015-09-23 17:48:01,930][WARN ][transport.netty ] [bak4] exception caught on transport layer [[id: 0x95434756]], closing connection
java.net.NoRouteToHostException: No route to host
...

And we found thousands of threads waiting
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
at org.elasticsearch.common.util.concurrent.KeyedLock.acquire(KeyedLock.
java:64)
at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyT
ransport.java:649)
at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyT
ransport.java:630)
at org.elasticsearch.transport.TransportService.connectToNode(TransportS
ervice.java:149)
...

At last, all the data nodes crushed for OOM error, because no more threads can be created.

Our elasticsearch version is 1.3.4.
Is this a known bug? Or we have done something error causing this problem?


(system) #2