Hi all.
We have had an issue at a customer where nodes are dropping out of the
cluster.
Ive managed to recreate a similar scenario in our test-environment, by
using tcpkill, killing the tcp-traffic between a node and the master node.
In this setup, there are 3 nodes.
When using tcpkil to kill TCP-connection between MASTER and NODE2, we get
this behaviour:
[tcpkill]
MASTER NODE:
----- Discovery log -----
Sep 24, 2013 10:32:28 AM
org.elasticsearch.common.logging.slf4j.Slf4jESLogger internalTrace
FINEST: [loadtest-appserver1] [node ]
[[loadtest-appserver2][R5jh4N7OTmykaFRHUx70AA][inet[/192.168.4.2:9300]]{local=false}]
transport disconnected (with verified connect)
----- Network log -----
Sep 24, 2013 10:32:28 AM
org.elasticsearch.common.logging.slf4j.Slf4jESLogger internalTrace
FINEST: [loadtest-appserver1] close connection exception caught on
transport layer [[id: 0xeb1569e7, /192.168.4.2:50232 => /192.168.4.1:9300]],
disconnecting from relevant node
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:109)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90)
at
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Sep 24, 2013 10:32:28 AM
org.elasticsearch.common.logging.slf4j.Slf4jESLogger internalTrace
FINEST: [loadtest-appserver1] channel closed: [id: 0xeb1569e7, /
192.168.4.2:50232 => /192.168.4.1:9300]
Sep 24, 2013 10:32:28 AM
org.elasticsearch.common.logging.slf4j.Slf4jESLogger internalTrace
FINEST: [loadtest-appserver1] close connection exception caught on
transport layer [[id: 0x90606514, /192.168.4.1:33969 => /192.168.4.2:9300]],
disconnecting from relevant node
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:109)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:312)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90)
at
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Sep 24, 2013 10:32:28 AM
org.elasticsearch.common.logging.slf4j.Slf4jESLogger internalTrace
FINEST: [loadtest-appserver1] channel closed: [id: 0x4ca1672c, /
192.168.4.2:50223 => /192.168.4.1:9300]
Sep 24, 2013 10:32:28 AM
org.elasticsearch.common.logging.slf4j.Slf4jESLogger internalTrace
FINEST: [loadtest-appserver1] channel closed: [id: 0xa6bd9ab0, /
192.168.4.2:50226 => /192.168.4.1:9300]
Sep 24, 2013 10:32:28 AM
org.elasticsearch.common.logging.slf4j.Slf4jESLogger internalTrace
FINEST: [loadtest-appserver1] channel closed: [id: 0xe9c0cc49, /
192.168.4.2:50225 => /192.168.4.1:9300]
Sep 24, 2013 10:32:28 AM
org.elasticsearch.common.logging.slf4j.Slf4jESLogger internalTrace
FINEST: [loadtest-appserver1] channel closed: [id: 0x246a84b0, /
192.168.4.2:50230 => /192.168.4.1:9300]
Sep 24, 2013 10:32:28 AM
org.elasticsearch.common.logging.slf4j.Slf4jESLogger internalTrace
FINEST: [loadtest-appserver1] channel closed: [id: 0x9d07abe3, /
192.168.4.2:50231 => /192.168.4.1:9300]
Sep 24, 2013 10:32:28 AM
org.elasticsearch.common.logging.slf4j.Slf4jESLogger internalTrace
FINEST: [loadtest-appserver1] channel closed: [id: 0x82aea2d8, /
192.168.4.2:50229 => /192.168.4.1:9300]
Sep 24, 2013 10:32:28 AM
org.elasticsearch.common.logging.slf4j.Slf4jESLogger internalTrace
FINEST: [loadtest-appserver1] channel closed: [id: 0x3ea2f9d0, /
192.168.4.2:50228 => /192.168.4.1:9300]
Sep 24, 2013 10:32:28 AM
org.elasticsearch.common.logging.slf4j.Slf4jESLogger internalTrace
FINEST: [loadtest-appserver1] channel closed: [id: 0x922efe9c, /
192.168.4.2:50227 => /192.168.4.1:9300]
Sep 24, 2013 10:32:28 AM
org.elasticsearch.common.logging.slf4j.Slf4jESLogger internalTrace
FINEST: [loadtest-appserver1] channel closed: [id: 0x5ccc53ad, /
192.168.4.2:50224 => /192.168.4.1:9300]
Sep 24, 2013 10:32:28 AM
org.elasticsearch.common.logging.slf4j.Slf4jESLogger internalDebug
FINE: [loadtest-appserver1] disconnected from
[[loadtest-appserver2][R5jh4N7OTmykaFRHUx70AA][inet[/192.168.4.2:9300]]{local=false}],
channel closed event
NODE 2
----- Discovery log -----
FINE: [loadtest-appserver2] [master] stopping fault detection against
master [[loadtest-appserver1][8xIvXqkSQuCzAt4lyTpJRQ][inet[/192.168.4.1:9300]]{local=false}],
reason [master failure, transport disconnected (with verified connect)]
Sep 24, 2013 10:32:28 AM
org.elasticsearch.common.logging.slf4j.Slf4jESLogger internalInfo
INFO: [loadtest-appserver2] master_left
[[loadtest-appserver1][8xIvXqkSQuCzAt4lyTpJRQ][inet[/192.168.4.1:9300]]{local=false}],
reason [transport disconnected (with verified connect)]
----- Network log -----
Sep 24, 2013 10:32:28 AM
org.elasticsearch.common.logging.slf4j.Slf4jESLogger internalDebug
FINE: [loadtest-appserver2] disconnected from
[[loadtest-appserver1][8xIvXqkSQuCzAt4lyTpJRQ][inet[/192.168.4.1:9300]]{local=false}],
channel closed event
NODE 3
----- Discovery log -----
Sep 24, 2013 10:32:28 AM
org.elasticsearch.common.logging.slf4j.Slf4jESLogger internalDebug
FINE: [loadtest-appserver3] [master] restarting fault detection against
master [[loadtest-appserver2][R5jh4N7OTmykaFRHUx70AA][inet[/192.168.4.2:9300]]{local=false}],
reason [new cluster state received and we are monitoring the wrong master
[[loadtest-appserver1][8xIvXqkSQuCzAt4lyTpJRQ][inet[/192.168.4.1:930
0]]{local=false}]]
----- Network log -----
FINEST: [loadtest-appserver3] channel opened: [id: 0x659ca590, /
192.168.4.3:45096 => /192.168.4.3:9300]
Sep 24, 2013 10:32:28 AM
org.elasticsearch.common.logging.slf4j.Slf4jESLogger internalDebug
FINE: [loadtest-appserver3] disconnected from
[[loadtest-appserver2][R5jh4N7OTmykaFRHUx70AA][inet[/192.168.4.2:9300
]]{local=false}]
Sep 24, 2013 10:32:28 AM
org.elasticsearch.common.logging.slf4j.Slf4jESLogger internalDebug
FINE: [loadtest-appserver3] connected to node
[[loadtest-appserver2][R5jh4N7OTmykaFRHUx70AA][inet[/192.168.4.2:9300
]]{local=false}]
Sep 24, 2013 10:32:28 AM
org.elasticsearch.common.logging.slf4j.Slf4jESLogger internalDebug
FINE: [loadtest-appserver3] disconnected from
[[loadtest-appserver1][8xIvXqkSQuCzAt4lyTpJRQ][inet[/192.168.4.1:9300
]]{local=false}]
[/tcpkill]
When just killing the network interface of NODE2, this is the behaviour:
[ifdown interface]
MASTER NODE:
----- Discovery log -----
FINEST: [loadtest-appserver1] [node ] failed to ping
[[loadtest-appserver2][WzI53XwyQF21NrbrCy76CQ][inet[/192.168.4.2:9300]]{local=false}],
retry [1] out of [3]
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[loadtest-appserver2][inet[/192.168.4.2:9300]][discovery/zen/fd/ping]
request_id [230] timed out after [30000ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:351)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Sep 24, 2013 10:46:54 AM
org.elasticsearch.common.logging.slf4j.Slf4jESLogger internalTrace
FINEST: [loadtest-appserver1] [node ] failed to ping
[[loadtest-appserver2][WzI53XwyQF21NrbrCy76CQ][inet[/192.168.4.2:9300]]{local=false}],
retry [2] out of [3]
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[loadtest-appserver2][inet[/192.168.4.2:9300]][discovery/zen/fd/ping]
request_id [267] timed out after [30000ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:351)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Sep 24, 2013 10:47:24 AM
org.elasticsearch.common.logging.slf4j.Slf4jESLogger internalTrace
FINEST: [loadtest-appserver1] [node ] failed to ping
[[loadtest-appserver2][WzI53XwyQF21NrbrCy76CQ][inet[/192.168.4.2:9300]]{local=false}],
retry [3] out of [3]
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[loadtest-appserver2][inet[/192.168.4.2:9300]][discovery/zen/fd/ping]
request_id [300] timed out after [30000ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:351)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Sep 24, 2013 10:47:24 AM
org.elasticsearch.common.logging.slf4j.Slf4jESLogger internalDebug
FINE: [loadtest-appserver1] [node ] failed to ping
[[loadtest-appserver2][WzI53XwyQF21NrbrCy76CQ][inet[/192.168.4.2:9300]]{local=false}],
tried [3] times, each with maximum [30s] timeout
Sep 24, 2013 10:48:54 AM
org.elasticsearch.common.logging.slf4j.Slf4jESLogger internalDebug
FINE: [loadtest-appserver1] failed to send cluster state to
[[loadtest-appserver2][WzI53XwyQF21NrbrCy76CQ][inet[/192.168.4.2:9300]]{local=false}],
should be detected as failed soon...
org.elasticsearch.transport.NodeDisconnectedException:
[loadtest-appserver2][inet[/192.168.4.2:9300]][discovery/zen/publish]
disconnected
Sep 24, 2013 10:48:54 AM
org.elasticsearch.common.logging.slf4j.Slf4jESLogger internalDebug
FINE: [loadtest-appserver1] failed to send cluster state to
[[loadtest-appserver2][WzI53XwyQF21NrbrCy76CQ][inet[/192.168.4.2:9300]]{local=false}],
should be detected as failed soon...
org.elasticsearch.transport.NodeDisconnectedException:
[loadtest-appserver2][inet[/192.168.4.2:9300]][discovery/zen/publish]
disconnected
Sep 24, 2013 10:48:54 AM
org.elasticsearch.common.logging.slf4j.Slf4jESLogger internalDebug
FINE: [loadtest-appserver1] failed to send cluster state to
[[loadtest-appserver2][WzI53XwyQF21NrbrCy76CQ][inet[/192.168.4.2:9300]]{local=false}],
should be detected as failed soon...
org.elasticsearch.transport.NodeDisconnectedException:
[loadtest-appserver2][inet[/192.168.4.2:9300]][discovery/zen/publish]
disconnected
----- Network log -----
Sep 24, 2013 10:48:54 AM
org.elasticsearch.common.logging.slf4j.Slf4jESLogger internalDebug
FINE: [loadtest-appserver1] disconnected from
[[loadtest-appserver2][WzI53XwyQF21NrbrCy76CQ][inet[/192.168.4.2:9300
]]{local=false}]
Sep 24, 2013 10:48:54 AM
org.elasticsearch.common.logging.slf4j.Slf4jESLogger internalTrace
FINEST: [loadtest-appserver1] close connection exception caught on
transport layer [[id: 0x1a51ae87, /192.168.4.1:34050 :> /192.168.4.2:9300]],
disconnecting from relevant node
java.nio.channels.ClosedChannelException
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.cleanUpWriteBuffer(AbstractNioWorker.java:406)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.close(AbstractNioWorker.java:374)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:58)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:574)
at
org.elasticsearch.common.netty.channel.Channels.close(Channels.java:812)
at
org.elasticsearch.common.netty.channel.AbstractChannel.close(AbstractChannel.java:197)
at
org.elasticsearch.transport.netty.NettyTransport$NodeChannels.closeChannelsAndWait(NettyTransport.java:902)
at
org.elasticsearch.transport.netty.NettyTransport$NodeChannels.close(NettyTransport.java:891)
at
org.elasticsearch.transport.netty.NettyTransport.disconnectFromNode(NettyTransport.java:759)
at
org.elasticsearch.transport.TransportService.disconnectFromNode(TransportService.java:137)
at
org.elasticsearch.cluster.service.InternalClusterService$UpdateTask$1.run(InternalClusterService.java:406)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
NODE 2
----- Discovery log -----
FINE: [loadtest-appserver2] [master] failed to ping
[[loadtest-appserver1][vtpBrETrTESCcia7XHM_qA][inet[/192.168.4.1:9300]]{local=false}],
tried [3] times, each with maximum [30s] timeout
Sep 24, 2013 10:47:24 AM
org.elasticsearch.common.logging.slf4j.Slf4jESLogger internalDebug
FINE: [loadtest-appserver2] [master] stopping fault detection against
master [[loadtest-appserver1][vtpBrETrTESCcia7XHM_qA][inet[/192.168.4.1:9300]]{local=false}],
reason [master failure, failed to ping, tried [3] times, each with maximum
[30s] timeout]
Sep 24, 2013 10:47:24 AM
org.elasticsearch.common.logging.slf4j.Slf4jESLogger internalInfo
INFO: [loadtest-appserver2] master_left
[[loadtest-appserver1][vtpBrETrTESCcia7XHM_qA][inet[/192.168.4.1:9300]]{local=false}],
reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
Sep 24, 2013 10:47:24 AM
org.elasticsearch.common.logging.slf4j.Slf4jESLogger internalDebug
FINE: [loadtest-appserver2] [master] restarting fault detection against
master [[loadtest-appserver3][0ljeiF-RRZy65dhxjJ22DQ][inet[/192.168.4.3:9300]]{local=false}],
reason [possible elected master since master left (reason = failed to ping,
tried [3] times, each with maximum [30s] timeout)]
Sep 24, 2013 10:48:55 AM
org.elasticsearch.common.logging.slf4j.Slf4jESLogger internalDebug
FINE: [loadtest-appserver2] [master] failed to ping
[[loadtest-appserver3][0ljeiF-RRZy65dhxjJ22DQ][inet[/192.168.4.3:9300]]{local=false}],
tried [3] times, each with maximum [30s] timeout
Sep 24, 2013 10:48:55 AM
org.elasticsearch.common.logging.slf4j.Slf4jESLogger internalDebug
FINE: [loadtest-appserver2] [master] stopping fault detection against
master [[loadtest-appserver3][0ljeiF-RRZy65dhxjJ22DQ][inet[/192.168.4.3:9300]]{local=false}],
reason [master failure, failed to ping, tried [3] times, each with maximum
[30s] timeout]
Sep 24, 2013 10:48:55 AM
org.elasticsearch.common.logging.slf4j.Slf4jESLogger internalInfo
INFO: [loadtest-appserver2] master_left
[[loadtest-appserver3][0ljeiF-RRZy65dhxjJ22DQ][inet[/192.168.4.3:9300]]{local=false}],
reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
----- Network log -----
Sep 24, 2013 10:48:55 AM
org.elasticsearch.common.logging.slf4j.Slf4jESLogger internalDebug
FINE: [loadtest-appserver2] disconnected from
[[loadtest-appserver3][0ljeiF-RRZy65dhxjJ22DQ][inet[/192.168.4.3:9300
]]{local=false}]
NODE 3
----- Discovery log -----
Nothing relevant
----- Network log -----
Sep 24, 2013 10:48:54 AM
org.elasticsearch.common.logging.slf4j.Slf4jESLogger internalDebug
FINE: [loadtest-appserver3] disconnected from
[[loadtest-appserver2][WzI53XwyQF21NrbrCy76CQ][inet[/192.168.4.2:9300
]]{local=false}]
[/if down interface]
Scenario 2, with killing the network interface is well handled, trying a
number of times to reconnect, with a set timeout. Scenario 1 on the other
hand, is not very graceful. The master node is immediatly disbanded as
master and a new master is selected (this can probably be handled by the
minimum_master_nodes settings, but this is not acceptable in our scenario).
Thus, a "normal" network disconnect is fine, the killing of the tcp-channel
is causing a lot of trouble. Is there any way to handle this more
gracefully, and is this indented behavior?
While its prioritized to discover the reason for the tcp channel disconnect
at our customer, it seems like the consequences of this tcpkill is a
critical weakness in the solution.
regards
Runar Myklebust
Enonic AS
Senior Developer
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.