Hello,
I am doing a bit of work to harden our Elasticsearch cluster configuration.
We've had issues with split brain scenarios in the past. We have clusters
with 3 or 5 nodes where any node can be handling index and search
operations. We have used the min master nodes to try and alleviate issues
but it did not work for us, still split brains.
We are using ES 1.1.0 with the latest ZooKeeper plugin. I have a test
cluster of 3 nodes running on m3.large instances w/ 500 GB EBS volumes. I
have tested a few scenarios so far which have performed as expected:
- Communication between a node and ZK going down. After short timeout
(~30 seconds) node is eliminated from cluster. - The sudden death of a node (master or otherwise) via 'kill -9'.
Rebalancing and election worked out very well here. - Stopping a node cleanly, nothing odd here works every time and ZK
makes cluster state updates really fast. - Adding new nodes, again quick cluster state updates via ZK.
The last scenario I am interested in is network partitions. In this case I
am trying to sever the communication between two of the nodes and a third.
I have been using iptables to DROP all in/out bound data from one of the 3
nodes in the test cluster to the other 2. I basically make four entries on
the node I want to cease communication with.
After doing so it takes a very long time for the node to finally be evicted
from cluster state. During this time a number of api methods will stop
working, including /_stats and /_nodes but also search will time out on the
node where com was severed. GOOD news is no split brains, bad news is
eviction of the bad node takes a looooong time.
Any help with explaining what is going on or how I can better test this
sort of scenario is much appreciated.
cheers,
Rob
The following is a bunch of the info on the exceptions I see when things
start to time out finally. Until the node is removed from cluster state
everything works kinda wonky.
The exception on a node trying to talk to the iptable'd node looks like
this:
[2014-04-28 21:15:36,488][WARN ][cluster.service ] [Kukulcan]
failed to reconnect to node [Harold "Happy" Hogan][_xgiPJYmSuecN0--yDB
mlg][zookeeper-test-builders-us-west-1-i-c9f6b095][inet[/10.168.250.15:9300]]{availabilityzone=us-west-1b}
org.elasticsearch.transport.ConnectTransportException: [Harold "Happy"
Hogan][inet[/10.168.250.15:9300]] connect_timeout[30s]
at
org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:773)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:702)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:670)
at
org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:129)
at
org.elasticsearch.cluster.service.InternalClusterService$ReconnectToNodes.run(InternalClusterService.java:515)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: org.elasticsearch.common.netty.channel.ConnectTimeoutException:
connection timed out: /10.168.250.15:9300
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.processConnectTimeout(NioClientBoss.java:137)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:83)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42)
at
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
I get these messages on the iptable'd node:
[2014-04-28 21:08:37,831][DEBUG][action.admin.indices.stats] [Harold
"Happy" Hogan] [events-2014.04.27][0], node[Cin_0uRIQwubm585lpkYnQ], [P],
s[STARTED]: Failed to execute
[org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@bc812]
org.elasticsearch.transport.NodeDisconnectedException: [Herr
Kleiser][inet[/10.176.41.55:9300]][indices/stats/s] disconnected
[2014-04-28 21:08:37,832][DEBUG][action.admin.indices.stats] [Harold
"Happy" Hogan] [events-2014.04.28][0], node[Cin_0uRIQwubm585lpkYnQ], [P],
s[STARTED]: Failed to execute
[org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@7c82439e]
org.elasticsearch.transport.NodeDisconnectedException: [Herr
Kleiser][inet[/10.176.41.55:9300]][indices/stats/s] disconnected
[2014-04-28 21:08:37,831][DEBUG][action.admin.indices.stats] [Harold
"Happy" Hogan] [_river][0], node[Cin_0uRIQwubm585lpkYnQ], [R], s[STARTED]:
Failed to execute
[org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@5e30581e]
org.elasticsearch.transport.NodeDisconnectedException: [Herr
Kleiser][inet[/10.176.41.55:9300]][indices/stats/s] disconnected
[2014-04-28 21:08:37,832][DEBUG][action.admin.indices.stats] [Harold
"Happy" Hogan] [events-2014.04.27][2], node[Cin_0uRIQwubm585lpkYnQ], [P],
s[STARTED]: Failed to execute
[org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@7c82439e]
org.elasticsearch.transport.NodeDisconnectedException: [Herr
Kleiser][inet[/10.176.41.55:9300]][indices/stats/s] disconnected
[2014-04-28 21:08:37,833][DEBUG][action.admin.indices.stats] [Harold
"Happy" Hogan] [events-2014.04.25][0], node[Cin_0uRIQwubm585lpkYnQ], [R],
s[STARTED]: Failed to execute
[org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@bc812]
org.elasticsearch.transport.NodeDisconnectedException: [Herr
Kleiser][inet[/10.176.41.55:9300]][indices/stats/s] disconnected
[2014-04-28 21:08:37,833][DEBUG][action.admin.indices.stats] [Harold
"Happy" Hogan] [events-2014.04.25][3], node[Cin_0uRIQwubm585lpkYnQ], [R],
s[STARTED]: Failed to execute
[org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@bc812]
org.elasticsearch.transport.NodeDisconnectedException: [Herr
Kleiser][inet[/10.176.41.55:9300]][indices/stats/s] disconnected
[2014-04-28 21:08:37,833][DEBUG][action.admin.indices.stats] [Harold
"Happy" Hogan] [events-2014.04.27][2], node[Cin_0uRIQwubm585lpkYnQ], [P],
s[STARTED]: Failed to execute
[org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@5e30581e]
org.elasticsearch.transport.NodeDisconnectedException: [Herr
Kleiser][inet[/10.176.41.55:9300]][indices/stats/s] disconnected
[2014-04-28 21:08:37,834][DEBUG][action.admin.indices.stats] [Harold
"Happy" Hogan] [events-2014.04.26][1], node[Cin_0uRIQwubm585lpkYnQ],
relocating [_xgiPJYmSuecN0--yDBmlg], [R], s[RELOCATING]: Failed to execute
[org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@bc8
12]
and on of the nodes trying to talk with the iptable'd node:
[2014-04-28 21:09:29,346][DEBUG][action.admin.cluster.node.info] [Herr
Kleiser] failed to execute on node [_xgiPJYmSuecN0--yDBmlg]
org.elasticsearch.transport.NodeDisconnectedException: [Harold "Happy"
Hogan][inet[/10.168.250.15:9300]][cluster/nodes/info/n] disconnected
[2014-04-28 21:09:29,349][DEBUG][action.admin.cluster.node.info] [Herr
Kleiser] failed to execute on node [_xgiPJYmSuecN0--yDBmlg]
org.elasticsearch.transport.NodeDisconnectedException: [Harold "Happy"
Hogan][inet[/10.168.250.15:9300]][cluster/nodes/info/n] disconnected
[2014-04-28 21:09:29,349][DEBUG][action.admin.cluster.node.info] [Herr
Kleiser] failed to execute on node [_xgiPJYmSuecN0--yDBmlg]
org.elasticsearch.transport.NodeDisconnectedException: [Harold "Happy"
Hogan][inet[/10.168.250.15:9300]][cluster/nodes/info/n] disconnected
[2014-04-28 21:09:29,351][DEBUG][action.admin.cluster.node.info] [Herr
Kleiser] failed to execute on node [_xgiPJYmSuecN0--yDBmlg]
org.elasticsearch.transport.NodeDisconnectedException: [Harold "Happy"
Hogan][inet[/10.168.250.15:9300]][cluster/nodes/info/n] disconnected
[2014-04-28 21:09:29,350][DEBUG][action.admin.cluster.node.info] [Herr
Kleiser] failed to execute on node [_xgiPJYmSuecN0--yDBmlg]
org.elasticsearch.transport.NodeDisconnectedException: [Harold "Happy"
Hogan][inet[/10.168.250.15:9300]][cluster/nodes/info/n] disconnected
[2014-04-28 21:09:29,349][DEBUG][action.admin.cluster.node.info] [Herr
Kleiser] failed to execute on node [_xgiPJYmSuecN0--yDBmlg]
org.elasticsearch.transport.NodeDisconnectedException: [Harold "Happy"
Hogan][inet[/10.168.250.15:9300]][cluster/nodes/info/n] disconnected
[2014-04-28 21:12:26,195][DEBUG][action.admin.cluster.node.info] [Herr
Kleiser] failed to execute on node [_xgiPJYmSuecN0--yDBmlg]
org.elasticsearch.transport.SendRequestTransportException: [Harold "Happy"
Hogan][inet[/10.168.250.15:9300]][cluster/nodes/info/n]
at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:202)
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.start(TransportNodesOperationAction.java:170)
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.access$300(TransportNodesOperationAction.java:102
)
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction.doExecute(TransportNodesOperationAction.java:73)
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction.doExecute(TransportNodesOperationAction.java:43)
at
org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63)
at
org.elasticsearch.client.node.NodeClusterAdminClient.execute(NodeClusterAdminClient.java:72)
at
org.elasticsearch.client.support.AbstractClusterAdminClient.nodesInfo(AbstractClusterAdminClient.java:183)
at
org.elasticsearch.rest.action.admin.cluster.node.info.RestNodesInfoAction.handleRequest(RestNodesInfoAction.java:105)
at
org.elasticsearch.rest.RestController.executeHandler(RestController.java:159)
at
org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:142)
at
org.elasticsearch.http.HttpServer.internalDispatchRequest(HttpServer.java:121)
at
org.elasticsearch.http.HttpServer$Dispatcher.dispatchRequest(HttpServer.java:83)
at
org.elasticsearch.http.netty.NettyHttpServerTransport.dispatchRequest(NettyHttpServerTransport.java:291)
at
org.elasticsearch.http.netty.HttpRequestHandler.messageReceived(HttpRequestHandler.java:43)
at
org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.jav
a:791)
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/82935f3d-2f01-42a9-afcb-5496e96daf42%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.