Node not connected

Hi all,

This morning, my two nodes cluster was in bad states.

The first server was initially the master. after a long GC it seem's to
have a network problem.
The second server after 3 timeout ping become the master.

If ES has nicely avoided the split-brain, i was hoping the first server
reconnect itself after some time but that's never happened.

I finally restart Elasticsearch on the first server, the version is 0.90.3

Following the log.

First server

[2013-10-07 06:42:55,851][WARN ][monitor.jvm ] [sissor1]
[gc][ConcurrentMarkSweep][1772297][6041] duration [1.5m], collections
[1]/[1.5m], total [1.5m]/[2.3h], memory [91gb]->[77.9gb]/[127.8gb],
all_pools {[Code Cache] [19.4mb]->[19.4mb]/[48mb]}{[Par Eden Space]
[825mb]->[13.6mb]/[865.3mb]}{[Par Survivor Space]
[108.1mb]->[0b]/[108.1mb]}{[CMS Old Gen] [90.1gb]->[77.9gb]/[126.9gb]}{[CMS
Perm Gen] [40.4mb]->[40.3mb]/[166mb]}
[2013-10-07 06:42:55,851][WARN ][transport.netty ] [sissor1]
exception caught on transport layer [[id: 0x7a39b48a, /192.168.110.90:42621
=> /192.168.110.80:9300]], closing connection
java.io.IOException: Relais brisé (pipe)
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:89)
at sun.nio.ch.IOUtil.write(IOUtil.java:46)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
at
org.elasticsearch.common.netty.channel.socket.nio.SocketSendBufferPool$UnpooledSendBuffer.transferTo(SocketSendBufferPool.java:203)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.write0(AbstractNioWorker.java:202)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.writeFromUserCode(AbstractNioWorker.java:147)
at
org.elasticsearch.common.netty.channel.socket.nio.NioServerSocketPipelineSink.handleAcceptedSocket(NioServerSocketPipelineSink.java:99)
at
org.elasticsearch.common.netty.channel.socket.nio.NioServerSocketPipelineSink.eventSunk(NioServerSocketPipelineSink.java:36)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:574)
at
org.elasticsearch.common.netty.channel.Channels.write(Channels.java:704)
at
org.elasticsearch.common.netty.channel.Channels.write(Channels.java:671)
at
org.elasticsearch.common.netty.channel.AbstractChannel.write(AbstractChannel.java:248)
at
org.elasticsearch.transport.netty.NettyTransportChannel.sendResponse(NettyTransportChannel.java:88)
at
org.elasticsearch.transport.netty.NettyTransportChannel.sendResponse(NettyTransportChannel.java:62)
at
org.elasticsearch.discovery.zen.fd.MasterFaultDetection$MasterPingRequestHandler.messageReceived(MasterFaultDetection.java:387)
at
org.elasticsearch.discovery.zen.fd.MasterFaultDetection$MasterPingRequestHandler.messageReceived(MasterFaultDetection.java:362)
at
org.elasticsearch.transport.netty.MessageChannelHandler.handleRequest(MessageChannelHandler.java:211)
at
org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:108)
at
org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:296)
at
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
at
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
at
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
at
org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at
org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:109)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90)
at
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)

Second server

[2013-10-07 06:42:53,372][INFO ][discovery.zen ] [sissor2]
master_left
[[sissor1][JhOUEniPT6aFTQVAt0cNMg][inet[/192.168.110.80:9300]]], reason
[failed to ping, tried [3] times, each with maximum [30s] timeout]
[2013-10-07 06:42:53,380][INFO ][cluster.service ] [sissor2]
master {new [sissor2][tJqVPKnITWq0TgqAyxP8Yg][inet[/192.168.110.90:9300]],
previous [sissor1][JhOUEniPT6aFTQVAt0cNMg][inet[/192.168.110.80:9300]]},
removed {[sissor1][JhOUEniPT6aFTQVAt0cNMg][inet[/192.168.110.80:9300]],},
reason: zen-disco-master_failed
([sissor1][JhOUEniPT6aFTQVAt0cNMg][inet[/192.168.110.80:9300]])
[2013-10-07 06:45:33,357][WARN ][discovery.zen ] [sissor2]
received cluster state from
[[sissor1][JhOUEniPT6aFTQVAt0cNMg][inet[/192.168.110.80:9300]]] which is
also master but with an older cluster_state, telling
[[sissor1][JhOUEniPT6aFTQVAt0cNMg][inet[/192.168.110.80:9300]]] to rejoin
the cluster
[2013-10-07 06:45:33,368][WARN ][discovery.zen ] [sissor2]
failed to send rejoin request to
[[sissor1][JhOUEniPT6aFTQVAt0cNMg][inet[/192.168.110.80:9300]]]
org.elasticsearch.transport.SendRequestTransportException:
[sissor1][inet[/192.168.110.80:9300]][discovery/zen/rejoin]
at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:203)
at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:172)
at
org.elasticsearch.discovery.zen.ZenDiscovery$7.execute(ZenDiscovery.java:545)
at
org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:285)
at
org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:143)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
Caused by: org.elasticsearch.transport.NodeNotConnectedException:
[sissor1][inet[/192.168.110.80:9300]] Node not connected
at
org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:806)
at
org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:520)
at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:188)
... 7 more

Regards

Benoît

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hello again,

No comment here.

My questions are not clear, I rephrase :

  • Is it possible that this is the long GC that causes the loss of the node
    to the cluster?
  • Is it normal that the node has finally not reconnected?

Regards.

Benoît

On Monday, October 7, 2013 3:11:18 PM UTC+2, Benoît wrote:

Hi all,

This morning, my two nodes cluster was in bad states.

The first server was initially the master. after a long GC it seem's to
have a network problem.
The second server after 3 timeout ping become the master.

If ES has nicely avoided the split-brain, i was hoping the first server
reconnect itself after some time but that's never happened.

I finally restart Elasticsearch on the first server, the version is 0.90.3

Following the log.

First server

[2013-10-07 06:42:55,851][WARN ][monitor.jvm ] [sissor1]
[gc][ConcurrentMarkSweep][1772297][6041] duration [1.5m], collections
[1]/[1.5m], total [1.5m]/[2.3h], memory [91gb]->[77.9gb]/[127.8gb],
all_pools {[Code Cache] [19.4mb]->[19.4mb]/[48mb]}{[Par Eden Space]
[825mb]->[13.6mb]/[865.3mb]}{[Par Survivor Space]
[108.1mb]->[0b]/[108.1mb]}{[CMS Old Gen] [90.1gb]->[77.9gb]/[126.9gb]}{[CMS
Perm Gen] [40.4mb]->[40.3mb]/[166mb]}
[2013-10-07 06:42:55,851][WARN ][transport.netty ] [sissor1]
exception caught on transport layer [[id: 0x7a39b48a, /
192.168.110.90:42621 => /192.168.110.80:9300]], closing connection
java.io.IOException: Relais brisé (pipe)
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:89)
at sun.nio.ch.IOUtil.write(IOUtil.java:46)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
at
org.elasticsearch.common.netty.channel.socket.nio.SocketSendBufferPool$UnpooledSendBuffer.transferTo(SocketSendBufferPool.java:203)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.write0(AbstractNioWorker.java:202)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.writeFromUserCode(AbstractNioWorker.java:147)
at
org.elasticsearch.common.netty.channel.socket.nio.NioServerSocketPipelineSink.handleAcceptedSocket(NioServerSocketPipelineSink.java:99)
at
org.elasticsearch.common.netty.channel.socket.nio.NioServerSocketPipelineSink.eventSunk(NioServerSocketPipelineSink.java:36)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:574)
at
org.elasticsearch.common.netty.channel.Channels.write(Channels.java:704)
at
org.elasticsearch.common.netty.channel.Channels.write(Channels.java:671)
at
org.elasticsearch.common.netty.channel.AbstractChannel.write(AbstractChannel.java:248)
at
org.elasticsearch.transport.netty.NettyTransportChannel.sendResponse(NettyTransportChannel.java:88)
at
org.elasticsearch.transport.netty.NettyTransportChannel.sendResponse(NettyTransportChannel.java:62)
at
org.elasticsearch.discovery.zen.fd.MasterFaultDetection$MasterPingRequestHandler.messageReceived(MasterFaultDetection.java:387)
at
org.elasticsearch.discovery.zen.fd.MasterFaultDetection$MasterPingRequestHandler.messageReceived(MasterFaultDetection.java:362)
at
org.elasticsearch.transport.netty.MessageChannelHandler.handleRequest(MessageChannelHandler.java:211)
at
org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:108)
at
org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:296)
at
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
at
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
at
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
at
org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at
org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:109)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90)
at
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)

Second server

[2013-10-07 06:42:53,372][INFO ][discovery.zen ] [sissor2]
master_left
[[sissor1][JhOUEniPT6aFTQVAt0cNMg][inet[/192.168.110.80:9300]]], reason
[failed to ping, tried [3] times, each with maximum [30s] timeout]
[2013-10-07 06:42:53,380][INFO ][cluster.service ] [sissor2]
master {new [sissor2][tJqVPKnITWq0TgqAyxP8Yg][inet[/192.168.110.90:9300]],
previous [sissor1][JhOUEniPT6aFTQVAt0cNMg][inet[/192.168.110.80:9300]]},
removed {[sissor1][JhOUEniPT6aFTQVAt0cNMg][inet[/192.168.110.80:9300]],},
reason: zen-disco-master_failed
([sissor1][JhOUEniPT6aFTQVAt0cNMg][inet[/192.168.110.80:9300]])
[2013-10-07 06:45:33,357][WARN ][discovery.zen ] [sissor2]
received cluster state from
[[sissor1][JhOUEniPT6aFTQVAt0cNMg][inet[/192.168.110.80:9300]]] which is
also master but with an older cluster_state, telling
[[sissor1][JhOUEniPT6aFTQVAt0cNMg][inet[/192.168.110.80:9300]]] to rejoin
the cluster
[2013-10-07 06:45:33,368][WARN ][discovery.zen ] [sissor2]
failed to send rejoin request to
[[sissor1][JhOUEniPT6aFTQVAt0cNMg][inet[/192.168.110.80:9300]]]
org.elasticsearch.transport.SendRequestTransportException:
[sissor1][inet[/192.168.110.80:9300]][discovery/zen/rejoin]
at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:203)
at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:172)
at
org.elasticsearch.discovery.zen.ZenDiscovery$7.execute(ZenDiscovery.java:545)
at
org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:285)
at
org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:143)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
Caused by: org.elasticsearch.transport.NodeNotConnectedException:
[sissor1][inet[/192.168.110.80:9300]] Node not connected
at
org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:806)
at
org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:520)
at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:188)
... 7 more

Regards

Benoît

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Benoît.

  • Is it possible that this is the long GC that causes the loss of the node to the cluster?

Yes - I've certainly seen this in my own cluster.

  • Is it normal that the node has finally not reconnected?

Once a node has left the cluster, I don't recall seeing it rejoin without being restarted. Other people, more experienced than I am, may be able to give you more information.

Cheers,
Dan

Dan Fairs | dan.fairs@gmail.com | @danfairs | secondsync.com

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thank you Dan for your feedback.

I really need to look into these GC problems.

Regards.

Benoît

On Tuesday, October 8, 2013 5:54:16 PM UTC+2, Dan Fairs wrote:

Hi Benoît.

  • Is it possible that this is the long GC that causes the loss of the node
    to the cluster?

Yes - I've certainly seen this in my own cluster.

  • Is it normal that the node has finally not reconnected?

Once a node has left the cluster, I don't recall seeing it rejoin without
being restarted. Other people, more experienced than I am, may be able to
give you more information.

Cheers,
Dan

Dan Fairs | dan....@gmail.com <javascript:> | @danfairs | secondsync.com

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.