Discovery_zen disconnect issues

kelaban · July 26, 2013, 3:35pm

Hi All, I am seeing some weirdness in terms of cluster connectivity across
multiple environments. I have two environments setup (qa and production),
both are identically setup. The configuration of each environments is 4 VMs
(ny01, ny02, nj01, nj02) with two instances of elasticsearch running on
each (two separate clusters). One cluster is running elasticsearch 0.18.7
on the standard ports (http: 9200/ tcp: 9300), this cluster has been setup
about a year and is not experiencing any sorts of problems, we will call
this one cluster A. I am currently trying to setup a second cluster using
version 0.90.0 using ports (http: 9400/ tcp: 9500), call this one cluster
B.

For cluster B I am seeing sporadic disconnects between the nodes with
elasticsearch going into odd but not quite split brain scenarios where the
view of the cluster from one VM isn't quite the same as the view from the
other VM. The settings are almost identical between cluster A and B,
minimum master nodes=3 and multicast disabled, other than that everything
is mostly vanilla. The only difference in settings is that cluster B uses
shard allocation awareness on the datacenter attribute, but i don't suspect
this to be culprit of the issue, but i could be wrong. All the problems I
am explaining with cluster B are seen in both qa and production
environments although I haven't been able to manually force any of the
issues to occur.

Some theories I've had are:

It happens under heavy load
-False, i've seen it happen during periods of inactivity and during
constant stress
Minimum number of master nodes is not being set correctly,

I've tried setting minimum number of master nodes through admin api and
elasticsearch.yml directly

Network congestion

Not completely debunked, however cluster A has no problems

Here is the view of the cluster from nj01
{
ok: true
cluster_name: newspr_production
nodes: {
EQuwUxYRSj6UAXD_4lKghg: {
name: nj01
transport_address: inet[/xxx:9500]
hostname: nj01
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: nj
}
}
Uv1PDUPgS4SVUbl1myHEjw: {
name: nj02
transport_address: inet[/xxx:9500]
hostname: nj02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: nj
}
}
NSwz2geDRR6F7WhszPA0qQ: {
name: ny02
transport_address: inet[/xxx:9500]
hostname: ny02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: ny
}
}
}
}

Here is the view of the cluster from ny01
{
ok: true
cluster_name: newspr_production
nodes: {
Uv1PDUPgS4SVUbl1myHEjw: {
name: nj02
transport_address: inet[/xxx:9500]
hostname: nj02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: nj
}
}
-XYoaJ2MSA6HaMk_CgmBFQ: {
name: ny01
transport_address: inet[/xxx:9500]
hostname: ny01
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: ny
}
}
NSwz2geDRR6F7WhszPA0qQ: {
name: ny02
transport_address: inet[/xxx:9500]
hostname: ny02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: ny
}
}
}
}

Here are the logs from ny01:
[2013-07-25 00:29:11,424][INFO ][cluster.service ] [ny01] removed
{[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},}, reason:zen
-disco-node_failed([nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj}), reason transport disconnected (with verified connect)
[2013-07-25 00:29:14,060][WARN ][discovery.zen ] [ny01] notenough master nodes
, current nodes: {[ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter
=ny},[ny02][NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=ny},}
[2013-07-25 00:29:14,060][INFO ][cluster.service ] [ny01] removed
{[nj02][Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{datacenter=nj},[ny02][
NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen-disco
-node_failed([nj02][Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{datacenter=nj
}), reason transport disconnected (with verified connect)
[2013-07-25 00:29:17,118][INFO ][cluster.service ] [ny01]detected_master
[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj}, added {[nj01
][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},[nj02][
Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{datacenter=nj},[ny02][
NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen-disco
-receive(from master [[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj}])
[2013-07-25 08:52:32,535][INFO ][discovery.zen ] [ny01]master_left
[[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj}], reason [transport
disconnected (with verified connect)]
[2013-07-25 08:52:32,537][INFO ][cluster.service ] [ny01] master {
new [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny},previous
[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj}}, removed {[
nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},}, reason: zen
-disco-master_failed ([nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj})
[2013-07-25 15:04:35,187][INFO ][cluster.service ] [ny01] added {[
nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},}, reason: zen
-disco-receive(join from node[[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500
]]{datacenter=nj}])
[2013-07-25 15:04:35,245][WARN ][cluster.action.shard ] [ny01] received
shard failed for [press_release][0], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[
STARTED], reason [master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 15:04:35,245][WARN ][cluster.action.shard ] [ny01] received
shard failed for [press_release][3], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[
STARTED], reason [master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 15:04:35,246][WARN ][cluster.action.shard ] [ny01] received
shard failed for [press_release][2], node[Uv1PDUPgS4SVUbl1myHEjw], [P], s[
STARTED], reason [master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 15:04:35,253][WARN ][cluster.action.shard ] [ny01] received
shard failed for [press_release][1], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[
STARTED], reason [master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 15:04:35,303][WARN ][cluster.action.shard ] [ny01] received
shard failed for [press_release][1], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[
STARTED], reason [master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 15:04:35,303][WARN ][cluster.action.shard ] [ny01] received
shard failed for [press_release][0], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[
STARTED], reason [master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 15:04:35,303][WARN ][cluster.action.shard ] [ny01] received
shard failed for [press_release][2], node[Uv1PDUPgS4SVUbl1myHEjw], [P], s[
STARTED], reason [master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 15:04:35,352][WARN ][cluster.action.shard ] [ny01] received
shard failed for [press_release][1], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[
STARTED], reason [master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 15:04:35,352][WARN ][cluster.action.shard ] [ny01] received
shard failed for [press_release][2], node[Uv1PDUPgS4SVUbl1myHEjw], [P], s[
STARTED], reason [master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 15:04:35,413][WARN ][cluster.action.shard ] [ny01] received
shard failed for [press_release][1], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[
STARTED], reason [master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 15:04:35,779][WARN ][discovery.zen ] [ny01] received
a join request for an existing node [[nj02][Uv1PDUPgS4SVUbl1myHEjw][inet[/
xxx:9500]]{datacenter=nj}]

Here are logs from nj01:
[2013-07-25 00:29:12,043][INFO ][discovery.zen ] [nj01]master_left
[[ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}], reason [do
not exists on master, act as master failure]
[2013-07-25 00:29:12,045][INFO ][cluster.service ] [nj01] master {
new [nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},previous
[ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}}, removed {[
ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen
-disco-master_failed ([ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny})
[2013-07-25 00:29:17,111][INFO ][cluster.service ] [nj01] added {[
ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen
-disco-receive(join from node[[ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500
]]{datacenter=ny}])
[2013-07-25 04:41:16,491][WARN ][discovery.zen ] [nj01] received
a join request for an existing node [[ny02][NSwz2geDRR6F7WhszPA0qQ][inet[/
xxx:9500]]{datacenter=ny}]
[2013-07-25 04:41:16,545][WARN ][cluster.action.shard ] [nj01] received
shard failed for [press_release][0], node[NSwz2geDRR6F7WhszPA0qQ], [R], s[
STARTED], reason [master [nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 04:41:16,549][WARN ][cluster.action.shard ] [nj01] received
shard failed for [press_release][1], node[NSwz2geDRR6F7WhszPA0qQ], [R], s[
STARTED], reason [master [nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 04:41:16,550][WARN ][cluster.action.shard ] [nj01] received
shard failed for [press_release][3], node[NSwz2geDRR6F7WhszPA0qQ], [P], s[
STARTED], reason [master [nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 04:41:16,546][WARN ][cluster.action.shard ] [nj01] received
shard failed for [press_release][2], node[NSwz2geDRR6F7WhszPA0qQ], [R], s[
STARTED], reason [master [nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 04:41:16,603][WARN ][cluster.action.shard ] [nj01] received
shard failed for [press_release][2], node[NSwz2geDRR6F7WhszPA0qQ], [R], s[
STARTED], reason [master [nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 04:41:16,607][WARN ][cluster.action.shard ] [nj01] received
shard failed for [press_release][1], node[NSwz2geDRR6F7WhszPA0qQ], [R], s[
STARTED], reason [master [nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 04:41:16,607][WARN ][cluster.action.shard ] [nj01] received
shard failed for [press_release][3], node[NSwz2geDRR6F7WhszPA0qQ], [P], s[
STARTED], reason [master [nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 04:41:16,674][WARN ][cluster.action.shard ] [nj01] received
shard failed for [press_release][1], node[NSwz2geDRR6F7WhszPA0qQ], [R], s[
STARTED], reason [master [nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 04:41:16,676][WARN ][cluster.action.shard ] [nj01] received
shard failed for [press_release][2], node[NSwz2geDRR6F7WhszPA0qQ], [R], s[
STARTED], reason [master [nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 04:41:16,676][WARN ][cluster.action.shard ] [nj01] received
shard failed for [press_release][1], node[NSwz2geDRR6F7WhszPA0qQ], [R], s[
STARTED], reason [master [nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 13:03:52,669][WARN ][transport ] [nj01] Receivedresponse
for a request that has timed out, sent [44493ms] ago, timed out [14482ms]ago
, action [discovery/zen/fd/ping], node [[nj02][Uv1PDUPgS4SVUbl1myHEjw][inet
[/xxx:9500]]{datacenter=nj}], id [245699]
[2013-07-25 13:03:52,672][INFO ][cluster.service ] [nj01] removed
{[ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny},}, reason:zen
-disco-node_failed([ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny}), reason transport disconnected (with verified connect)
[2013-07-25 15:04:32,137][WARN ][discovery.zen ] [nj01] notenough master nodes
, current nodes: {[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter
=nj},[nj02][Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{datacenter=nj},}
[2013-07-25 15:04:32,138][INFO ][cluster.service ] [nj01] removed
{[nj02][Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{datacenter=nj},[ny02][
NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen-disco
-node_failed([ny02][NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=ny
}), reason transport disconnected (with verified connect)
[2013-07-25 15:04:32,139][WARN ][transport ] [nj01] Receivedresponse
for a request that has timed out, sent [44355ms] ago, timed out [14344ms]ago
, action [discovery/zen/fd/ping], node [[nj02][Uv1PDUPgS4SVUbl1myHEjw][inet
[/xxx:9500]]{datacenter=nj}], id [274173]
[2013-07-25 15:04:35,190][INFO ][cluster.service ] [nj01]detected_master
[ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}, added {[ny01
][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny},[nj02][
Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{datacenter=nj},[ny02][
NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen-disco
-receive(from master [[ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny}])
[2013-07-25 19:16:35,360][INFO ][discovery.zen ] [nj01]master_left
[[ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}], reason [transport
disconnected (with verified connect)]
[2013-07-25 19:16:35,360][INFO ][cluster.service ] [nj01] master {
new [nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},previous
[ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}}, removed {[
ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen
-disco-master_failed ([ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny})

Thanks to all the prospective undertakers, please let me know if you would
like some more detailed information

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

ppearcy · July 27, 2013, 8:11am

Other than confirming timeout values are the same (or defaults haven't
changed between 0.18.7 and 0.90), I'd recommend moving up to latest
release.

Looks like you are running your cluster across DCs? I would personally
avoid doing that due connectivity issues that are nearly inevitable, but
I'm probably just paranoid after having data get nuked during split brains
in pre 0.13 releases It could be a more common setup than I think,
though, and I think there are enough dials to be able to do this safely.

Also, this issue might be related:

github.com/elastic/elasticsearch

split brain condition after second network disconnect - even with minimum_master_nodes set

opened 06:52AM - 25 Jul 12 UTC

closed 02:09AM - 16 Jun 14 UTC

owenbutler

>bug

## Summary: Split brain can occur on the second network disconnect of a node, w…hen the minimum_master_nodes is configured correctly(n/2+1). The split brain occurs if the nodeId(UUID) of the disconnected node is such that the disconnected node picks itself as the next logical master while pinging the other nodes(NodeFaultDetection). The split brain only occurs on the second time that the node is disconnected/isolated. ## Detail: Using ZenDiscovery, Node Id's are randomly generated(A UUID): ZenDiscovery:169. When the node is disconnected/isolated it the ElectMasterService uses an ordered list of the Nodes (Ordered by nodeId) to determine a new potential master. It picks the first of the ordered list: ElectMasterService:95 Because the nodeId's are random, it's possible for the disconnected/isolated node to be first in the ordered list, electing itself as a possible master. The first time network is disconnected, the minimum_master_nodes property is honored and the disconnected/isolated node goes into a "ping" mode, where it simply tries to ping for other nodes. Once the network is re-connected, the node re-joins the cluster successfully. The Second time the network is disconnected, the minimum_master_nodes intent is not honored. The disconnected/isolated node fails to realise that it's not connected to the remaining node in the 3 node cluster and elects itself as master, still thinking it's connected. It feels like there is a failure in the transition between MasterFaultDetection and NodeFaultDetection, because it works the first time! The fault only occurs if the nodeId is ordered such that the disconnected node picks itself as the master while isolated. If the nodeId's are ordered such that it picks one of the other 2 nodes to be potential master then the isolated node honors the minimum_master_nodes intent every time. Because the nodeId's are randomly(UUID) generated, the probability of this occuring drops as the number of nodes in the cluster goes up. For our 3 node cluster it's ~50% (with one node detected as gone, it's up to the ordering of the remaining two nodeId's) Note, While we were trying track this down we found that the cluster.service TRACE level logging (which outputs the cluster state) does not list the nodes in election order. IE, the first node in that printed list is not necessarily going to elected as master by the isolated node. ## Detail Steps to reproduce: Because the ordering of the nodeId's is random(UUID) we were having trouble getting a consitantly reproducable test case. To fix the ordering, we made a patch to ZenDiscovery to allow us to optionally configure a nodeId. This allowed us to set the nodeId of the disconnected/isolated node to guarantee it's ordering, allowing us to consistently reproduce. We've tested this scenario on the 0.19.4, 0.19.7, 0.19.8 distributions and see the error when the nodeId's were ordered just right. We also tested this scenario on the current git master with the supplied patch. In this scenario, node3 will the be the node we disconnect/isolate. So we start the nodes up in numerical order to ensure node3 doesn't _start_ as master. 1. Configure nodes with attached configs (one is provided for each node) 2. Start up nodes 1 and 2. After they are attached and one is master, start node 3 3. Create a blank index with default shard/replica(5/1) settings 4. Pull network cable from node 3 5. Node 3 detects master has gone (MasterFaultDetection) 6. Node 3 elects itself as master (Because the nodeId's are ordered just right) 7. Node 3 detects the remaining node has gone, enters ZenDiscovery minimum_master_nodes mode, prints a message indicating not enough nodes 8. Node 3 goes into a ping state looking for nodes 9. At this point, node 1 and node 2 report a valid cluster, they know about each other but not about node 3. 10. Reconnect network to node 3 11. Node 3 rejoins the cluster correctly, seeing that there is already a master in the cluster. At this point, everything is working as expected. 1. Pull network cable from node 3 again 2. Node 3 detects master has gone (MasterFaultDetection) 3. Node 3 elects as itself as master (Because the nodeId's are ordered just right) 4. Node 3 now fails to detect that the remaining node in the cluster is not accessible. It starts throwing a number of Netty NoRouteToHostExceptions about the remaining node. 5. According to node 3, cluster health is yellow and cluster state shows 2 data nodes 6. Reconnect network to node 3 7. Node 3 appears to connect to the node that it thinks it's still connected to. (can see that via the cluster state api). The other nodes log nothing and do not show the disconnected node as connected in any way. 8. Node 3 at this point accepts indexing and search requests, a classic split brain. Here's a gist with the patch to ZenDiscovery and the 3 node configs. https://gist.github.com/3174651

Best Regards,
Paul

On Friday, July 26, 2013 9:35:20 AM UTC-6, Keith L wrote:

Hi All, I am seeing some weirdness in terms of cluster connectivity across
multiple environments. I have two environments setup (qa and production),
both are identically setup. The configuration of each environments is 4 VMs
(ny01, ny02, nj01, nj02) with two instances of elasticsearch running on
each (two separate clusters). One cluster is running elasticsearch 0.18.7
on the standard ports (http: 9200/ tcp: 9300), this cluster has been setup
about a year and is not experiencing any sorts of problems, we will call
this one cluster A. I am currently trying to setup a second cluster using
version 0.90.0 using ports (http: 9400/ tcp: 9500), call this one cluster
B.

For cluster B I am seeing sporadic disconnects between the nodes with
elasticsearch going into odd but not quite split brain scenarios where the
view of the cluster from one VM isn't quite the same as the view from the
other VM. The settings are almost identical between cluster A and B,
minimum master nodes=3 and multicast disabled, other than that everything
is mostly vanilla. The only difference in settings is that cluster B uses
shard allocation awareness on the datacenter attribute, but i don't suspect
this to be culprit of the issue, but i could be wrong. All the problems I
am explaining with cluster B are seen in both qa and production
environments although I haven't been able to manually force any of the
issues to occur.

Some theories I've had are:

It happens under heavy load
-False, i've seen it happen during periods of inactivity and during
constant stress

Minimum number of master nodes is not being set correctly,

I've tried setting minimum number of master nodes through admin api and
elasticsearch.yml directly

Network congestion

Not completely debunked, however cluster A has no problems

Here is the view of the cluster from nj01
{
ok: true
cluster_name: newspr_production
nodes: {
EQuwUxYRSj6UAXD_4lKghg: {
name: nj01
transport_address: inet[/xxx:9500]
hostname: nj01
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: nj
}
}
Uv1PDUPgS4SVUbl1myHEjw: {
name: nj02
transport_address: inet[/xxx:9500]
hostname: nj02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: nj
}
}
NSwz2geDRR6F7WhszPA0qQ: {
name: ny02
transport_address: inet[/xxx:9500]
hostname: ny02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: ny
}
}
}
}

Here is the view of the cluster from ny01
{
ok: true
cluster_name: newspr_production
nodes: {
Uv1PDUPgS4SVUbl1myHEjw: {
name: nj02
transport_address: inet[/xxx:9500]
hostname: nj02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: nj
}
}
-XYoaJ2MSA6HaMk_CgmBFQ: {
name: ny01
transport_address: inet[/xxx:9500]
hostname: ny01
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: ny
}
}
NSwz2geDRR6F7WhszPA0qQ: {
name: ny02
transport_address: inet[/xxx:9500]
hostname: ny02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: ny
}
}
}
}

Here are the logs from ny01:
[2013-07-25 00:29:11,424][INFO ][cluster.service ] [ny01]removed
{[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},}, reason:zen
-disco-node_failed([nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj}), reason transport disconnected (with verified connect)
[2013-07-25 00:29:14,060][WARN ][discovery.zen ] [ny01] notenough master nodes
, current nodes: {[ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny},[ny02][NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=
ny},}
[2013-07-25 00:29:14,060][INFO ][cluster.service ] [ny01]removed
{[nj02][Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{datacenter=nj},[ny02][
NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen-
disco-node_failed([nj02][Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{
datacenter=nj}), reason transport disconnected (with verified connect)
[2013-07-25 00:29:17,118][INFO ][cluster.service ] [ny01]detected_master
[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj}, added {[
nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},[nj02][
Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{datacenter=nj},[ny02][
NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen-
disco-receive(from master [[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500
]]{datacenter=nj}])
[2013-07-25 08:52:32,535][INFO ][discovery.zen ] [ny01]master_left
[[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj}], reason [transport
disconnected (with verified connect)]
[2013-07-25 08:52:32,537][INFO ][cluster.service ] [ny01] master
{new [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny},previous
[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj}}, removed
{[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},}, reason:zen
-disco-master_failed ([nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj})
[2013-07-25 15:04:35,187][INFO ][cluster.service ] [ny01] added
{[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},}, reason:zen
-disco-receive(join from node[[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:
9500]]{datacenter=nj}])
[2013-07-25 15:04:35,245][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][0], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,245][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][3], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,246][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][2], node[Uv1PDUPgS4SVUbl1myHEjw], [P], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,253][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][1], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,303][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][1], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,303][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][0], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,303][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][2], node[Uv1PDUPgS4SVUbl1myHEjw], [P], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,352][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][1], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,352][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][2], node[Uv1PDUPgS4SVUbl1myHEjw], [P], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,413][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][1<span style="color: #660;" clas
...

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

kelaban · July 29, 2013, 3:50pm

Thanks for the reply Paul, that issue definitely looks relevant to this,
although I'm curious why some of these disconnects are occurring even
before the master election fails on the second attempt.

For example on this line:
[2013-07-25 15:04:32,138][INFO ][cluster.service ] [nj01] removed
{[nj02][Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{datacenter=nj},[ny02][
NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen-disco
-node_failed([ny02][NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=ny
}), reason transport disconnected (with verified connect)

There is no indication of any pings failing around that time. I do not
understand which underlying mechanism decides whether or not these nodes
should be removed.

On Saturday, July 27, 2013 4:11:10 AM UTC-4, ppearcy wrote:

Other than confirming timeout values are the same (or defaults haven't
changed between 0.18.7 and 0.90), I'd recommend moving up to latest
release.

Looks like you are running your cluster across DCs? I would personally
avoid doing that due connectivity issues that are nearly inevitable, but
I'm probably just paranoid after having data get nuked during split brains
in pre 0.13 releases It could be a more common setup than I think,
though, and I think there are enough dials to be able to do this safely.

Also, this issue might be related:
split brain condition after second network disconnect - even with minimum_master_nodes set · Issue #2117 · elastic/elasticsearch · GitHub

Best Regards,
Paul

On Friday, July 26, 2013 9:35:20 AM UTC-6, Keith L wrote:

Hi All, I am seeing some weirdness in terms of cluster connectivity
across multiple environments. I have two environments setup (qa and
production), both are identically setup. The configuration of each
environments is 4 VMs (ny01, ny02, nj01, nj02) with two instances of
elasticsearch running on each (two separate clusters). One cluster is
running elasticsearch 0.18.7 on the standard ports (http: 9200/ tcp: 9300),
this cluster has been setup about a year and is not experiencing any sorts
of problems, we will call this one cluster A. I am currently trying to
setup a second cluster using version 0.90.0 using ports (http: 9400/ tcp:
9500), call this one cluster B.

For cluster B I am seeing sporadic disconnects between the nodes with
elasticsearch going into odd but not quite split brain scenarios where the
view of the cluster from one VM isn't quite the same as the view from the
other VM. The settings are almost identical between cluster A and B,
minimum master nodes=3 and multicast disabled, other than that everything
is mostly vanilla. The only difference in settings is that cluster B uses
shard allocation awareness on the datacenter attribute, but i don't suspect
this to be culprit of the issue, but i could be wrong. All the problems I
am explaining with cluster B are seen in both qa and production
environments although I haven't been able to manually force any of the
issues to occur.

Some theories I've had are:

It happens under heavy load
-False, i've seen it happen during periods of inactivity and during
constant stress

Minimum number of master nodes is not being set correctly,

I've tried setting minimum number of master nodes through admin api
and elasticsearch.yml directly

Network congestion

Not completely debunked, however cluster A has no problems

Here is the view of the cluster from nj01
{
ok: true
cluster_name: newspr_production
nodes: {
EQuwUxYRSj6UAXD_4lKghg: {
name: nj01
transport_address: inet[/xxx:9500]
hostname: nj01
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: nj
}
}
Uv1PDUPgS4SVUbl1myHEjw: {
name: nj02
transport_address: inet[/xxx:9500]
hostname: nj02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: nj
}
}
NSwz2geDRR6F7WhszPA0qQ: {
name: ny02
transport_address: inet[/xxx:9500]
hostname: ny02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: ny
}
}
}
}

Here is the view of the cluster from ny01
{
ok: true
cluster_name: newspr_production
nodes: {
Uv1PDUPgS4SVUbl1myHEjw: {
name: nj02
transport_address: inet[/xxx:9500]
hostname: nj02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: nj
}
}
-XYoaJ2MSA6HaMk_CgmBFQ: {
name: ny01
transport_address: inet[/xxx:9500]
hostname: ny01
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: ny
}
}
NSwz2geDRR6F7WhszPA0qQ: {
name: ny02
transport_address: inet[/xxx:9500]
hostname: ny02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: ny
}
}
}
}

Here are the logs from ny01:
[2013-07-25 00:29:11,424][INFO ][cluster.service ] [ny01]removed
{[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},}, reason
: zen-disco-node_failed([nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj}), reason transport disconnected (with verified connect)
[2013-07-25 00:29:14,060][WARN ][discovery.zen ] [ny01] notenough master nodes
, current nodes: {[ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny},[ny02][NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter
=ny},}
[2013-07-25 00:29:14,060][INFO ][cluster.service ] [ny01]removed
{[nj02][Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{datacenter=nj},[ny02][
NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen-
disco-node_failed([nj02][Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{
datacenter=nj}), reason transport disconnected (with verified connect)
[2013-07-25 00:29:17,118][INFO ][cluster.service ] [ny01]detected_master
[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj}, added {[
nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},[nj02][
Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{datacenter=nj},[ny02][
NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen-
disco-receive(from master [[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500
]]{datacenter=nj}])
[2013-07-25 08:52:32,535][INFO ][discovery.zen ] [ny01]master_left
[[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj}], reason
[transport disconnected (with verified connect)]
[2013-07-25 08:52:32,537][INFO ][cluster.service ] [ny01]master
{new [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny},previous
[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj}}, removed
{[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},}, reason
: zen-disco-master_failed ([nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500
]]{datacenter=nj})
[2013-07-25 15:04:35,187][INFO ][cluster.service ] [ny01] added
{[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},}, reason
: zen-disco-receive(join from node[[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/
xxx:9500]]{datacenter=nj}])
[2013-07-25 15:04:35,245][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][0], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,245][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][3], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,246][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][2], node[Uv1PDUPgS4SVUbl1myHEjw], [P], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,253][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][1], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,303][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][1], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,303][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][0], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,303][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][2], node[Uv1PDUPgS4SVUbl1myHEjw], [P], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,352][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][1], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,352][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][2], node[Uv1PDUPgS4SVUbl1myHEjw], [P], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,413][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][1<span style="color: #660;" clas
...

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

kelaban · July 29, 2013, 4:01pm

Thanks for the reply Paul, that issue definitely looks relevant to this,
although I'm curious why some of these disconnects are occurring even
before the master election fails on the second attempt.

For example on this line:
[2013-07-25 13:03:52,672][INFO ][cluster.service ] [nj01] removed
{[ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny},}, reason:
zen-disco-node_failed([ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny}), reason transport disconnected (with verified connect)

There is no indication of any pings to ny01 failing around that time. I do
not understand which underlying mechanism decides whether or not these
nodes should be removed.

On Friday, July 26, 2013 11:35:20 AM UTC-4, Keith L wrote:

Hi All, I am seeing some weirdness in terms of cluster connectivity across
multiple environments. I have two environments setup (qa and production),
both are identically setup. The configuration of each environments is 4 VMs
(ny01, ny02, nj01, nj02) with two instances of elasticsearch running on
each (two separate clusters). One cluster is running elasticsearch 0.18.7
on the standard ports (http: 9200/ tcp: 9300), this cluster has been setup
about a year and is not experiencing any sorts of problems, we will call
this one cluster A. I am currently trying to setup a second cluster using
version 0.90.0 using ports (http: 9400/ tcp: 9500), call this one cluster
B.

For cluster B I am seeing sporadic disconnects between the nodes with
elasticsearch going into odd but not quite split brain scenarios where the
view of the cluster from one VM isn't quite the same as the view from the
other VM. The settings are almost identical between cluster A and B,
minimum master nodes=3 and multicast disabled, other than that everything
is mostly vanilla. The only difference in settings is that cluster B uses
shard allocation awareness on the datacenter attribute, but i don't suspect
this to be culprit of the issue, but i could be wrong. All the problems I
am explaining with cluster B are seen in both qa and production
environments although I haven't been able to manually force any of the
issues to occur.

Some theories I've had are:

It happens under heavy load
-False, i've seen it happen during periods of inactivity and during
constant stress

Minimum number of master nodes is not being set correctly,

I've tried setting minimum number of master nodes through admin api and
elasticsearch.yml directly

Network congestion

Not completely debunked, however cluster A has no problems

Here is the view of the cluster from nj01
{
ok: true
cluster_name: newspr_production
nodes: {
EQuwUxYRSj6UAXD_4lKghg: {
name: nj01
transport_address: inet[/xxx:9500]
hostname: nj01
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: nj
}
}
Uv1PDUPgS4SVUbl1myHEjw: {
name: nj02
transport_address: inet[/xxx:9500]
hostname: nj02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: nj
}
}
NSwz2geDRR6F7WhszPA0qQ: {
name: ny02
transport_address: inet[/xxx:9500]
hostname: ny02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: ny
}
}
}
}

Here is the view of the cluster from ny01
{
ok: true
cluster_name: newspr_production
nodes: {
Uv1PDUPgS4SVUbl1myHEjw: {
name: nj02
transport_address: inet[/xxx:9500]
hostname: nj02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: nj
}
}
-XYoaJ2MSA6HaMk_CgmBFQ: {
name: ny01
transport_address: inet[/xxx:9500]
hostname: ny01
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: ny
}
}
NSwz2geDRR6F7WhszPA0qQ: {
name: ny02
transport_address: inet[/xxx:9500]
hostname: ny02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: ny
}
}
}
}

Here are the logs from ny01:
[2013-07-25 00:29:11,424][INFO ][cluster.service ] [ny01]removed
{[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},}, reason:zen
-disco-node_failed([nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj}), reason transport disconnected (with verified connect)
[2013-07-25 00:29:14,060][WARN ][discovery.zen ] [ny01] notenough master nodes
, current nodes: {[ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny},[ny02][NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=
ny},}
[2013-07-25 00:29:14,060][INFO ][cluster.service ] [ny01]removed
{[nj02][Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{datacenter=nj},[ny02][
NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen-
disco-node_failed([nj02][Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{
datacenter=nj}), reason transport disconnected (with verified connect)
[2013-07-25 00:29:17,118][INFO ][cluster.service ] [ny01]detected_master
[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj}, added {[
nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},[nj02][
Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{datacenter=nj},[ny02][
NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen-
disco-receive(from master [[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500
]]{datacenter=nj}])
[2013-07-25 08:52:32,535][INFO ][discovery.zen ] [ny01]master_left
[[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj}], reason [transport
disconnected (with verified connect)]
[2013-07-25 08:52:32,537][INFO ][cluster.service ] [ny01] master
{new [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny},previous
[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj}}, removed
{[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},}, reason:zen
-disco-master_failed ([nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj})
[2013-07-25 15:04:35,187][INFO ][cluster.service ] [ny01] added
{[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},}, reason:zen
-disco-receive(join from node[[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:
9500]]{datacenter=nj}])
[2013-07-25 15:04:35,245][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][0], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,245][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][3], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,246][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][2], node[Uv1PDUPgS4SVUbl1myHEjw], [P], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,253][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][1], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,303][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][1], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,303][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][0], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,303][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][2], node[Uv1PDUPgS4SVUbl1myHEjw], [P], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,352][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][1], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,352][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][2], node[Uv1PDUPgS4SVUbl1myHEjw], [P], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,413][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][1<span style="color: #660;" clas
...

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

kelaban · July 30, 2013, 3:09pm

I've restarted the cluster and turned up the discovery and cluster logging
and noticed some interesting behavior.

Take this sequence of events:

Cluster:
1: ny01, ny02, nj01, nj02

2: nj02 detects master ny02 failed

[2013-07-30 04:59:17,212][TRACE][discovery.zen.fd ] [nj02] [master]
[[ny02][9fVeGJ7kSn6vFKTpcWRniQ][inet[/xxx:9500]]{datacenter=ny}] transport
disconnected (with verified connect)
[2013-07-30 04:59:17,215][DEBUG][discovery.zen.fd ] [nj02] [master]stopping fault detection against master
[[nyclpwssel
asc02][9fVeGJ7kSn6vFKTpcWRniQ][inet[/xxx:9500]]{datacenter=ny}], reason [master
failure, transport disconnected (with verified con
nect)]
[2013-07-30 04:59:17,216][INFO ][discovery.zen ] [nj02]master_left
[[ny02][9fVeGJ7kSn6vFKTpcWRniQ][inet[/
xxx:9500]]{datacenter=ny}], reason [transport disconnected (with verified
connect)]
[2013-07-30 04:59:17,217][DEBUG][cluster.service ] [nj02]processing
[zen-disco-master_failed ([ny02][9fVeGJ
7kSn6vFKTpcWRniQ][inet[/xxx:9500]]{datacenter=ny})]: execute
[2013-07-30 04:59:17,217][TRACE][cluster.service ] [nj02] cluster
state updated:
version [43], source [zen-disco-master_failed ([ny02][9fVeGJ7kSn6vFKTpcWRniQ
][inet[/xxx:9500]]{datacenter=ny})]
nodes:
[ny01][yuj0IIYCTKa1ieIRBLz1bQ][inet[/xxx:9500]]{datacenter=ny}
[nj01][OHFtlwC0QqueiE7ybmz8CA][inet[/xxx:9500]]{datacenter=nj}
[nj02][1E1OFkNORueWuOQL3cqu2Q][inet[/xxx:9500]]{datacenter=nj}, local,master

3: nj02 elects itself as master

[2013-07-30 04:59:17,217][INFO ][cluster.service ] [nj02] master {
new [nj02][1E1OFkNORueWuOQL3cqu2Q][inet[/xxx:9500]]{datacenter=nj},previous
[ny02][9fVeGJ7kSn6vFKTpcWRniQ][inet[/xxx:9500]]{datacenter=ny}}, removed {[
ny02][9fVeGJ7kSn6vFKTpcWRniQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen
-disco-master_failed ([ny02][9fVeGJ7kSn6vFKTpcWRniQ][inet[/xxx:9500]]{
datacenter=ny})
[2013-07-30 04:59:17,218][DEBUG][discovery.zen.publish ] [nj02] failed
to send cluster state to [[ny01][yuj0IIYCTKa1ieIRBLz1bQ][inet[/xxx:9500]]{
datacenter=ny}], should be detected as failed soon...
org.elasticsearch.transport.SendRequestTransportException: [ny01][inet[/xxx:
9500]][discovery/zen/publish]
at org.elasticsearch.transport.TransportService.sendRequest(
TransportService.java:199)
at org.elasticsearch.discovery.zen.publish.PublishClusterStateAction
.publish(PublishClusterStateAction.java:97)
at org.elasticsearch.discovery.zen.ZenDiscovery.publish(ZenDiscovery
.java:276)
at org.elasticsearch.discovery.DiscoveryService.publish(
DiscoveryService.java:115)
at org.elasticsearch.cluster.service.InternalClusterService$2.run(
InternalClusterService.java:311)
at org.elasticsearch.common.util.concurrent.
PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(
PrioritizedEsThreadPoolExecutor.java:95)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(
ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [ny01][
inet[/xxx:9500]] Node not connected
at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(
NettyTransport.java:788)
at org.elasticsearch.transport.netty.NettyTransport.sendRequest(
NettyTransport.java:522)
at org.elasticsearch.transport.TransportService.sendRequest(
TransportService.java:184)
... 8 more
[2013-07-30 04:59:17,218][DEBUG][cluster.service ] [nj02]processing
[zen-disco-master_failed ([ny02][9fVeGJ7kSn6vFKTpcWRniQ][inet[/xxx:9500]]{
datacenter=ny})]: done applying updated cluster_state
[2013-07-30 04:59:17,219][DEBUG][cluster.service ] [nj02]processing
[routing-table-updater]: execute

3: nj02 sends nj01 new cluster state and tell it that nj02 is the new master

[2013-07-30 04:59:17,220][DEBUG][cluster.service ] [nj01]processing
[zen-disco-receive(from master [[nj02][1E1OFkNORueWuOQL3cqu2Q][inet[/xxx:
9500]]{datacenter=nj}])]: execute
[2013-07-30 04:59:17,220][DEBUG][discovery.zen.fd ] [nj01] [master]restarting fault detection against master
[[nj02][1E1OFkNORueWuOQL3cqu2Q][inet[/xxx:9500]]{datacenter=nj}], reason [
new cluster stare received and we monitor the wrong master [[ny02][
9fVeGJ7kSn6vFKTpcWRniQ][inet[/xxx:9500]]{datacenter=ny}]]
[2013-07-30 04:59:17,220][TRACE][cluster.service ] [nj01] cluster
state updated:
...
[2013-07-30 04:59:17,220][INFO ][cluster.service ] [nj01] master {
new [nj02][1E1OFkNORueWuOQL3cqu2Q][inet[/xxx:9500]]{datacenter=nj},previous
[ny02][9fVeGJ7kSn6vFKTpcWRniQ][inet[/xxx:9500]]{datacenter=ny}}, removed {[
ny02][9fVeGJ7kSn6vFKTpcWRniQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen
-disco-receive(from master [[nj02][1E1OFkNORueWuOQL3cqu2Q][inet[/xxx:9500]]{
datacenter=nj}])
[2013-07-30 04:59:17,221][DEBUG][cluster.service ] [nj01]processing
[zen-disco-receive(from master [[nj02][1E1OFkNORueWuOQL3cqu2Q][inet[/xxx:
9500]]{datacenter=nj}])]: done applying updated cluster_state
[2013-07-30 04:59:17,221][DEBUG][transport.netty ] [nj01]disconnected
from [[ny02][9fVeGJ7kSn6vFKTpcWRniQ][inet[/xxx:9500]]{datacenter=ny}]
[2013-07-30 04:59:17,226][DEBUG][cluster.service ] [nj01]processing
[zen-disco-receive(from master [[nj02][1E1OFkNORueWuOQL3cqu2Q][inet[/xxx:
9500]]{datacenter=nj}])]: execute
[2013-07-30 04:59:17,227][TRACE][cluster.service ] [nj01] cluster
state updated:
version [44], source [zen-disco-receive(from master [[nj02][
1E1OFkNORueWuOQL3cqu2Q][inet[/xxx:9500]]{datacenter=nj}])]
nodes:
[ny01][yuj0IIYCTKa1ieIRBLz1bQ][inet[/xxx:9500]]{datacenter=ny}
[nj01][OHFtlwC0QqueiE7ybmz8CA][inet[/xxx:9500]]{datacenter=nj}, local
[nj02][1E1OFkNORueWuOQL3cqu2Q][inet[/xxx:9500]]{datacenter=nj}, master

Meanwhile...
ny01, ny02 have no idea about anything that is going on on the other two
nodes and we end up with the following two split brain clusters

Cluster1:
ny01, ny02(master), nj01, nj02

Cluster2:
ny01, nj01, nj02(master)

On Friday, July 26, 2013 11:35:20 AM UTC-4, Keith L wrote:

Hi All, I am seeing some weirdness in terms of cluster connectivity across
multiple environments. I have two environments setup (qa and production),
both are identically setup. The configuration of each environments is 4 VMs
(ny01, ny02, nj01, nj02) with two instances of elasticsearch running on
each (two separate clusters). One cluster is running elasticsearch 0.18.7
on the standard ports (http: 9200/ tcp: 9300), this cluster has been setup
about a year and is not experiencing any sorts of problems, we will call
this one cluster A. I am currently trying to setup a second cluster using
version 0.90.0 using ports (http: 9400/ tcp: 9500), call this one cluster
B.

For cluster B I am seeing sporadic disconnects between the nodes with
elasticsearch going into odd but not quite split brain scenarios where the
view of the cluster from one VM isn't quite the same as the view from the
other VM. The settings are almost identical between cluster A and B,
minimum master nodes=3 and multicast disabled, other than that everything
is mostly vanilla. The only difference in settings is that cluster B uses
shard allocation awareness on the datacenter attribute, but i don't suspect
this to be culprit of the issue, but i could be wrong. All the problems I
am explaining with cluster B are seen in both qa and production
environments although I haven't been able to manually force any of the
issues to occur.

Some theories I've had are:

It happens under heavy load
-False, i've seen it happen during periods of inactivity and during
constant stress

Minimum number of master nodes is not being set correctly,

I've tried setting minimum number of master nodes through admin api and
elasticsearch.yml directly

Network congestion

Not completely debunked, however cluster A has no problems

Here is the view of the cluster from nj01
{
ok: true
cluster_name: newspr_production
nodes: {
EQuwUxYRSj6UAXD_4lKghg: {
name: nj01
transport_address: inet[/xxx:9500]
hostname: nj01
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: nj
}
}
Uv1PDUPgS4SVUbl1myHEjw: {
name: nj02
transport_address: inet[/xxx:9500]
hostname: nj02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: nj
}
}
NSwz2geDRR6F7WhszPA0qQ: {
name: ny02
transport_address: inet[/xxx:9500]
hostname: ny02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: ny
}
}
}
}

Here is the view of the cluster from ny01
{
ok: true
cluster_name: newspr_production
nodes: {
Uv1PDUPgS4SVUbl1myHEjw: {
name: nj02
transport_address: inet[/xxx:9500]
hostname: nj02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: nj
}
}
-XYoaJ2MSA6HaMk_CgmBFQ: {
name: ny01
transport_address: inet[/xxx:9500]
hostname: ny01
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: ny
}
}
NSwz2geDRR6F7WhszPA0qQ: {
name: ny02
transport_address: inet[/xxx:9500]
hostname: ny02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: ny
}
}
}
}

Here are the logs from ny01:
[2013-07-25 00:29:11,424][INFO ][cluster.service ] [ny01]removed
{[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},}, reason:zen
-disco-node_failed([nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj}), reason transport disconnected (with verified connect)
[2013-07-25 00:29:14,060][WARN ][discovery.zen ] [ny01] notenough master nodes
, current nodes: {[ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny},[ny02][NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=
ny},}
[2013-07-25 00:29:14,060][INFO ][cluster.service ] [ny01]removed
{[nj02][Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{datacenter=nj},[ny02][
NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen-
disco-node_failed([nj02][Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{
datacenter=nj}), reason transport disconnected (with verified connect)
[2013-07-25 00:29:17,118][INFO ][cluster.service ] [ny01]detected_master
[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj}, added {[
nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},[nj02][
Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{datacenter=nj},[ny02][
NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen-
disco-receive(from master [[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500
]]{datacenter=nj}])
[2013-07-25 08:52:32,535][INFO ][discovery.zen ] [ny01]master_left
[[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj}], reason [transport
disconnected (with verified connect)]
[2013-07-25 08:52:32,537][INFO ][cluster.service ] [ny01] master
{new [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny},previous
[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj}}, removed
{[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},}, reason:zen
-disco-master_failed ([nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj})
[2013-07-25 15:04:35,187][INFO ][cluster.service ] [ny01] added
{[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},}, reason:zen
-disco-receive(join from node[[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:
9500]]{datacenter=nj}])
[2013-07-25 15:04:35,245][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][0], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,245][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][3], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,246][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][2], node[Uv1PDUPgS4SVUbl1myHEjw], [P], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,253][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][1], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,303][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][1], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,303][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][0], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,303][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][2], node[Uv1PDUPgS4SVUbl1myHEjw], [P], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,352][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][1], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,352][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][2], node[Uv1PDUPgS4SVUbl1myHEjw], [P], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,413][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][1<span style="color: #660;" clas
...

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Seeing weirdness with zen_disco_multicast Elasticsearch	1	316	July 6, 2017
Frequent disconnects between nodes Elasticsearch	13	2293	July 6, 2017
[SOLVED] Frequent node disconnects on Rackspace environment Elasticsearch	4	1826	July 5, 2017
Unable to communicate between the two elasticsearch clusters of different datacenters Elasticsearch	3	1766	May 22, 2018
Cluster broke after some network troubles Elasticsearch	4	1642	October 20, 2017

Discovery_zen disconnect issues

Related topics