Discovery_zen disconnect issues

Hi All, I am seeing some weirdness in terms of cluster connectivity across
multiple environments. I have two environments setup (qa and production),
both are identically setup. The configuration of each environments is 4 VMs
(ny01, ny02, nj01, nj02) with two instances of elasticsearch running on
each (two separate clusters). One cluster is running elasticsearch 0.18.7
on the standard ports (http: 9200/ tcp: 9300), this cluster has been setup
about a year and is not experiencing any sorts of problems, we will call
this one cluster A. I am currently trying to setup a second cluster using
version 0.90.0 using ports (http: 9400/ tcp: 9500), call this one cluster
B.

For cluster B I am seeing sporadic disconnects between the nodes with
elasticsearch going into odd but not quite split brain scenarios where the
view of the cluster from one VM isn't quite the same as the view from the
other VM. The settings are almost identical between cluster A and B,
minimum master nodes=3 and multicast disabled, other than that everything
is mostly vanilla. The only difference in settings is that cluster B uses
shard allocation awareness on the datacenter attribute, but i don't suspect
this to be culprit of the issue, but i could be wrong. All the problems I
am explaining with cluster B are seen in both qa and production
environments although I haven't been able to manually force any of the
issues to occur.

Some theories I've had are:

  1. It happens under heavy load
    -False, i've seen it happen during periods of inactivity and during
    constant stress
  2. Minimum number of master nodes is not being set correctly,
  • I've tried setting minimum number of master nodes through admin api and
    elasticsearch.yml directly
  1. Network congestion
  • Not completely debunked, however cluster A has no problems

Here is the view of the cluster from nj01
{
ok: true
cluster_name: newspr_production
nodes: {
EQuwUxYRSj6UAXD_4lKghg: {
name: nj01
transport_address: inet[/xxx:9500]
hostname: nj01
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: nj
}
}
Uv1PDUPgS4SVUbl1myHEjw: {
name: nj02
transport_address: inet[/xxx:9500]
hostname: nj02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: nj
}
}
NSwz2geDRR6F7WhszPA0qQ: {
name: ny02
transport_address: inet[/xxx:9500]
hostname: ny02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: ny
}
}
}
}

Here is the view of the cluster from ny01
{
ok: true
cluster_name: newspr_production
nodes: {
Uv1PDUPgS4SVUbl1myHEjw: {
name: nj02
transport_address: inet[/xxx:9500]
hostname: nj02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: nj
}
}
-XYoaJ2MSA6HaMk_CgmBFQ: {
name: ny01
transport_address: inet[/xxx:9500]
hostname: ny01
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: ny
}
}
NSwz2geDRR6F7WhszPA0qQ: {
name: ny02
transport_address: inet[/xxx:9500]
hostname: ny02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: ny
}
}
}
}

Here are the logs from ny01:
[2013-07-25 00:29:11,424][INFO ][cluster.service ] [ny01] removed
{[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},}, reason:zen
-disco-node_failed([nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj}), reason transport disconnected (with verified connect)
[2013-07-25 00:29:14,060][WARN ][discovery.zen ] [ny01] notenough master nodes
, current nodes: {[ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter
=ny},[ny02][NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=ny},}
[2013-07-25 00:29:14,060][INFO ][cluster.service ] [ny01] removed
{[nj02][Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{datacenter=nj},[ny02][
NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen-disco
-node_failed([nj02][Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{datacenter=nj
}), reason transport disconnected (with verified connect)
[2013-07-25 00:29:17,118][INFO ][cluster.service ] [ny01]detected_master
[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj}, added {[nj01
][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},[nj02][
Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{datacenter=nj},[ny02][
NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen-disco
-receive(from master [[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj}])
[2013-07-25 08:52:32,535][INFO ][discovery.zen ] [ny01]master_left
[[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj}], reason [transport
disconnected (with verified connect)]
[2013-07-25 08:52:32,537][INFO ][cluster.service ] [ny01] master {
new [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny},previous
[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj}}, removed {[
nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},}, reason: zen
-disco-master_failed ([nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj})
[2013-07-25 15:04:35,187][INFO ][cluster.service ] [ny01] added {[
nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},}, reason: zen
-disco-receive(join from node[[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500
]]{datacenter=nj}])
[2013-07-25 15:04:35,245][WARN ][cluster.action.shard ] [ny01] received
shard failed for [press_release][0], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[
STARTED], reason [master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 15:04:35,245][WARN ][cluster.action.shard ] [ny01] received
shard failed for [press_release][3], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[
STARTED], reason [master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 15:04:35,246][WARN ][cluster.action.shard ] [ny01] received
shard failed for [press_release][2], node[Uv1PDUPgS4SVUbl1myHEjw], [P], s[
STARTED], reason [master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 15:04:35,253][WARN ][cluster.action.shard ] [ny01] received
shard failed for [press_release][1], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[
STARTED], reason [master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 15:04:35,303][WARN ][cluster.action.shard ] [ny01] received
shard failed for [press_release][1], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[
STARTED], reason [master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 15:04:35,303][WARN ][cluster.action.shard ] [ny01] received
shard failed for [press_release][0], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[
STARTED], reason [master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 15:04:35,303][WARN ][cluster.action.shard ] [ny01] received
shard failed for [press_release][2], node[Uv1PDUPgS4SVUbl1myHEjw], [P], s[
STARTED], reason [master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 15:04:35,352][WARN ][cluster.action.shard ] [ny01] received
shard failed for [press_release][1], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[
STARTED], reason [master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 15:04:35,352][WARN ][cluster.action.shard ] [ny01] received
shard failed for [press_release][2], node[Uv1PDUPgS4SVUbl1myHEjw], [P], s[
STARTED], reason [master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 15:04:35,413][WARN ][cluster.action.shard ] [ny01] received
shard failed for [press_release][1], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[
STARTED], reason [master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 15:04:35,779][WARN ][discovery.zen ] [ny01] received
a join request for an existing node [[nj02][Uv1PDUPgS4SVUbl1myHEjw][inet[/
xxx:9500]]{datacenter=nj}]

Here are logs from nj01:
[2013-07-25 00:29:12,043][INFO ][discovery.zen ] [nj01]master_left
[[ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}], reason [do
not exists on master, act as master failure]
[2013-07-25 00:29:12,045][INFO ][cluster.service ] [nj01] master {
new [nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},previous
[ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}}, removed {[
ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen
-disco-master_failed ([ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny})
[2013-07-25 00:29:17,111][INFO ][cluster.service ] [nj01] added {[
ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen
-disco-receive(join from node[[ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500
]]{datacenter=ny}])
[2013-07-25 04:41:16,491][WARN ][discovery.zen ] [nj01] received
a join request for an existing node [[ny02][NSwz2geDRR6F7WhszPA0qQ][inet[/
xxx:9500]]{datacenter=ny}]
[2013-07-25 04:41:16,545][WARN ][cluster.action.shard ] [nj01] received
shard failed for [press_release][0], node[NSwz2geDRR6F7WhszPA0qQ], [R], s[
STARTED], reason [master [nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 04:41:16,549][WARN ][cluster.action.shard ] [nj01] received
shard failed for [press_release][1], node[NSwz2geDRR6F7WhszPA0qQ], [R], s[
STARTED], reason [master [nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 04:41:16,550][WARN ][cluster.action.shard ] [nj01] received
shard failed for [press_release][3], node[NSwz2geDRR6F7WhszPA0qQ], [P], s[
STARTED], reason [master [nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 04:41:16,546][WARN ][cluster.action.shard ] [nj01] received
shard failed for [press_release][2], node[NSwz2geDRR6F7WhszPA0qQ], [R], s[
STARTED], reason [master [nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 04:41:16,603][WARN ][cluster.action.shard ] [nj01] received
shard failed for [press_release][2], node[NSwz2geDRR6F7WhszPA0qQ], [R], s[
STARTED], reason [master [nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 04:41:16,607][WARN ][cluster.action.shard ] [nj01] received
shard failed for [press_release][1], node[NSwz2geDRR6F7WhszPA0qQ], [R], s[
STARTED], reason [master [nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 04:41:16,607][WARN ][cluster.action.shard ] [nj01] received
shard failed for [press_release][3], node[NSwz2geDRR6F7WhszPA0qQ], [P], s[
STARTED], reason [master [nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 04:41:16,674][WARN ][cluster.action.shard ] [nj01] received
shard failed for [press_release][1], node[NSwz2geDRR6F7WhszPA0qQ], [R], s[
STARTED], reason [master [nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 04:41:16,676][WARN ][cluster.action.shard ] [nj01] received
shard failed for [press_release][2], node[NSwz2geDRR6F7WhszPA0qQ], [R], s[
STARTED], reason [master [nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 04:41:16,676][WARN ][cluster.action.shard ] [nj01] received
shard failed for [press_release][1], node[NSwz2geDRR6F7WhszPA0qQ], [R], s[
STARTED], reason [master [nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 13:03:52,669][WARN ][transport ] [nj01] Receivedresponse
for a request that has timed out, sent [44493ms] ago, timed out [14482ms]ago
, action [discovery/zen/fd/ping], node [[nj02][Uv1PDUPgS4SVUbl1myHEjw][inet
[/xxx:9500]]{datacenter=nj}], id [245699]
[2013-07-25 13:03:52,672][INFO ][cluster.service ] [nj01] removed
{[ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny},}, reason:zen
-disco-node_failed([ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny}), reason transport disconnected (with verified connect)
[2013-07-25 15:04:32,137][WARN ][discovery.zen ] [nj01] notenough master nodes
, current nodes: {[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter
=nj},[nj02][Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{datacenter=nj},}
[2013-07-25 15:04:32,138][INFO ][cluster.service ] [nj01] removed
{[nj02][Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{datacenter=nj},[ny02][
NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen-disco
-node_failed([ny02][NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=ny
}), reason transport disconnected (with verified connect)
[2013-07-25 15:04:32,139][WARN ][transport ] [nj01] Receivedresponse
for a request that has timed out, sent [44355ms] ago, timed out [14344ms]ago
, action [discovery/zen/fd/ping], node [[nj02][Uv1PDUPgS4SVUbl1myHEjw][inet
[/xxx:9500]]{datacenter=nj}], id [274173]
[2013-07-25 15:04:35,190][INFO ][cluster.service ] [nj01]detected_master
[ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}, added {[ny01
][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny},[nj02][
Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{datacenter=nj},[ny02][
NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen-disco
-receive(from master [[ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny}])
[2013-07-25 19:16:35,360][INFO ][discovery.zen ] [nj01]master_left
[[ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}], reason [transport
disconnected (with verified connect)]
[2013-07-25 19:16:35,360][INFO ][cluster.service ] [nj01] master {
new [nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},previous
[ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}}, removed {[
ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen
-disco-master_failed ([ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny})

Thanks to all the prospective undertakers, please let me know if you would
like some more detailed information

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Other than confirming timeout values are the same (or defaults haven't
changed between 0.18.7 and 0.90), I'd recommend moving up to latest
release.

Looks like you are running your cluster across DCs? I would personally
avoid doing that due connectivity issues that are nearly inevitable, but
I'm probably just paranoid after having data get nuked during split brains
in pre 0.13 releases :slight_smile: It could be a more common setup than I think,
though, and I think there are enough dials to be able to do this safely.

Also, this issue might be related:

Best Regards,
Paul

On Friday, July 26, 2013 9:35:20 AM UTC-6, Keith L wrote:

Hi All, I am seeing some weirdness in terms of cluster connectivity across
multiple environments. I have two environments setup (qa and production),
both are identically setup. The configuration of each environments is 4 VMs
(ny01, ny02, nj01, nj02) with two instances of elasticsearch running on
each (two separate clusters). One cluster is running elasticsearch 0.18.7
on the standard ports (http: 9200/ tcp: 9300), this cluster has been setup
about a year and is not experiencing any sorts of problems, we will call
this one cluster A. I am currently trying to setup a second cluster using
version 0.90.0 using ports (http: 9400/ tcp: 9500), call this one cluster
B.

For cluster B I am seeing sporadic disconnects between the nodes with
elasticsearch going into odd but not quite split brain scenarios where the
view of the cluster from one VM isn't quite the same as the view from the
other VM. The settings are almost identical between cluster A and B,
minimum master nodes=3 and multicast disabled, other than that everything
is mostly vanilla. The only difference in settings is that cluster B uses
shard allocation awareness on the datacenter attribute, but i don't suspect
this to be culprit of the issue, but i could be wrong. All the problems I
am explaining with cluster B are seen in both qa and production
environments although I haven't been able to manually force any of the
issues to occur.

Some theories I've had are:

  1. It happens under heavy load
    -False, i've seen it happen during periods of inactivity and during
    constant stress
  2. Minimum number of master nodes is not being set correctly,
  • I've tried setting minimum number of master nodes through admin api and
    elasticsearch.yml directly
  1. Network congestion
  • Not completely debunked, however cluster A has no problems

Here is the view of the cluster from nj01
{
ok: true
cluster_name: newspr_production
nodes: {
EQuwUxYRSj6UAXD_4lKghg: {
name: nj01
transport_address: inet[/xxx:9500]
hostname: nj01
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: nj
}
}
Uv1PDUPgS4SVUbl1myHEjw: {
name: nj02
transport_address: inet[/xxx:9500]
hostname: nj02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: nj
}
}
NSwz2geDRR6F7WhszPA0qQ: {
name: ny02
transport_address: inet[/xxx:9500]
hostname: ny02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: ny
}
}
}
}

Here is the view of the cluster from ny01
{
ok: true
cluster_name: newspr_production
nodes: {
Uv1PDUPgS4SVUbl1myHEjw: {
name: nj02
transport_address: inet[/xxx:9500]
hostname: nj02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: nj
}
}
-XYoaJ2MSA6HaMk_CgmBFQ: {
name: ny01
transport_address: inet[/xxx:9500]
hostname: ny01
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: ny
}
}
NSwz2geDRR6F7WhszPA0qQ: {
name: ny02
transport_address: inet[/xxx:9500]
hostname: ny02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: ny
}
}
}
}

Here are the logs from ny01:
[2013-07-25 00:29:11,424][INFO ][cluster.service ] [ny01]removed
{[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},}, reason:zen
-disco-node_failed([nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj}), reason transport disconnected (with verified connect)
[2013-07-25 00:29:14,060][WARN ][discovery.zen ] [ny01] notenough master nodes
, current nodes: {[ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny},[ny02][NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=
ny},}
[2013-07-25 00:29:14,060][INFO ][cluster.service ] [ny01]removed
{[nj02][Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{datacenter=nj},[ny02][
NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen-
disco-node_failed([nj02][Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{
datacenter=nj}), reason transport disconnected (with verified connect)
[2013-07-25 00:29:17,118][INFO ][cluster.service ] [ny01]detected_master
[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj}, added {[
nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},[nj02][
Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{datacenter=nj},[ny02][
NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen-
disco-receive(from master [[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500
]]{datacenter=nj}])
[2013-07-25 08:52:32,535][INFO ][discovery.zen ] [ny01]master_left
[[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj}], reason [transport
disconnected (with verified connect)]
[2013-07-25 08:52:32,537][INFO ][cluster.service ] [ny01] master
{new [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny},previous
[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj}}, removed
{[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},}, reason:zen
-disco-master_failed ([nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj})
[2013-07-25 15:04:35,187][INFO ][cluster.service ] [ny01] added
{[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},}, reason:zen
-disco-receive(join from node[[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:
9500]]{datacenter=nj}])
[2013-07-25 15:04:35,245][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][0], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,245][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][3], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,246][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][2], node[Uv1PDUPgS4SVUbl1myHEjw], [P], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,253][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][1], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,303][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][1], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,303][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][0], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,303][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][2], node[Uv1PDUPgS4SVUbl1myHEjw], [P], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,352][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][1], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,352][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][2], node[Uv1PDUPgS4SVUbl1myHEjw], [P], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,413][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][1<span style="color: #660;" clas
...

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks for the reply Paul, that issue definitely looks relevant to this,
although I'm curious why some of these disconnects are occurring even
before the master election fails on the second attempt.

For example on this line:
[2013-07-25 15:04:32,138][INFO ][cluster.service ] [nj01] removed
{[nj02][Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{datacenter=nj},[ny02][
NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen-disco
-node_failed([ny02][NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=ny
}), reason transport disconnected (with verified connect)

There is no indication of any pings failing around that time. I do not
understand which underlying mechanism decides whether or not these nodes
should be removed.

On Saturday, July 27, 2013 4:11:10 AM UTC-4, ppearcy wrote:

Other than confirming timeout values are the same (or defaults haven't
changed between 0.18.7 and 0.90), I'd recommend moving up to latest
release.

Looks like you are running your cluster across DCs? I would personally
avoid doing that due connectivity issues that are nearly inevitable, but
I'm probably just paranoid after having data get nuked during split brains
in pre 0.13 releases :slight_smile: It could be a more common setup than I think,
though, and I think there are enough dials to be able to do this safely.

Also, this issue might be related:
split brain condition after second network disconnect - even with minimum_master_nodes set · Issue #2117 · elastic/elasticsearch · GitHub

Best Regards,
Paul

On Friday, July 26, 2013 9:35:20 AM UTC-6, Keith L wrote:

Hi All, I am seeing some weirdness in terms of cluster connectivity
across multiple environments. I have two environments setup (qa and
production), both are identically setup. The configuration of each
environments is 4 VMs (ny01, ny02, nj01, nj02) with two instances of
elasticsearch running on each (two separate clusters). One cluster is
running elasticsearch 0.18.7 on the standard ports (http: 9200/ tcp: 9300),
this cluster has been setup about a year and is not experiencing any sorts
of problems, we will call this one cluster A. I am currently trying to
setup a second cluster using version 0.90.0 using ports (http: 9400/ tcp:
9500), call this one cluster B.

For cluster B I am seeing sporadic disconnects between the nodes with
elasticsearch going into odd but not quite split brain scenarios where the
view of the cluster from one VM isn't quite the same as the view from the
other VM. The settings are almost identical between cluster A and B,
minimum master nodes=3 and multicast disabled, other than that everything
is mostly vanilla. The only difference in settings is that cluster B uses
shard allocation awareness on the datacenter attribute, but i don't suspect
this to be culprit of the issue, but i could be wrong. All the problems I
am explaining with cluster B are seen in both qa and production
environments although I haven't been able to manually force any of the
issues to occur.

Some theories I've had are:

  1. It happens under heavy load
    -False, i've seen it happen during periods of inactivity and during
    constant stress
  2. Minimum number of master nodes is not being set correctly,
  • I've tried setting minimum number of master nodes through admin api
    and elasticsearch.yml directly
  1. Network congestion
  • Not completely debunked, however cluster A has no problems

Here is the view of the cluster from nj01
{
ok: true
cluster_name: newspr_production
nodes: {
EQuwUxYRSj6UAXD_4lKghg: {
name: nj01
transport_address: inet[/xxx:9500]
hostname: nj01
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: nj
}
}
Uv1PDUPgS4SVUbl1myHEjw: {
name: nj02
transport_address: inet[/xxx:9500]
hostname: nj02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: nj
}
}
NSwz2geDRR6F7WhszPA0qQ: {
name: ny02
transport_address: inet[/xxx:9500]
hostname: ny02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: ny
}
}
}
}

Here is the view of the cluster from ny01
{
ok: true
cluster_name: newspr_production
nodes: {
Uv1PDUPgS4SVUbl1myHEjw: {
name: nj02
transport_address: inet[/xxx:9500]
hostname: nj02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: nj
}
}
-XYoaJ2MSA6HaMk_CgmBFQ: {
name: ny01
transport_address: inet[/xxx:9500]
hostname: ny01
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: ny
}
}
NSwz2geDRR6F7WhszPA0qQ: {
name: ny02
transport_address: inet[/xxx:9500]
hostname: ny02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: ny
}
}
}
}

Here are the logs from ny01:
[2013-07-25 00:29:11,424][INFO ][cluster.service ] [ny01]removed
{[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},}, reason
: zen-disco-node_failed([nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj}), reason transport disconnected (with verified connect)
[2013-07-25 00:29:14,060][WARN ][discovery.zen ] [ny01] notenough master nodes
, current nodes: {[ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny},[ny02][NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter
=ny},}
[2013-07-25 00:29:14,060][INFO ][cluster.service ] [ny01]removed
{[nj02][Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{datacenter=nj},[ny02][
NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen-
disco-node_failed([nj02][Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{
datacenter=nj}), reason transport disconnected (with verified connect)
[2013-07-25 00:29:17,118][INFO ][cluster.service ] [ny01]detected_master
[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj}, added {[
nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},[nj02][
Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{datacenter=nj},[ny02][
NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen-
disco-receive(from master [[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500
]]{datacenter=nj}])
[2013-07-25 08:52:32,535][INFO ][discovery.zen ] [ny01]master_left
[[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj}], reason
[transport disconnected (with verified connect)]
[2013-07-25 08:52:32,537][INFO ][cluster.service ] [ny01]master
{new [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny},previous
[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj}}, removed
{[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},}, reason
: zen-disco-master_failed ([nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500
]]{datacenter=nj})
[2013-07-25 15:04:35,187][INFO ][cluster.service ] [ny01] added
{[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},}, reason
: zen-disco-receive(join from node[[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/
xxx:9500]]{datacenter=nj}])
[2013-07-25 15:04:35,245][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][0], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,245][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][3], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,246][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][2], node[Uv1PDUPgS4SVUbl1myHEjw], [P], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,253][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][1], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,303][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][1], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,303][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][0], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,303][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][2], node[Uv1PDUPgS4SVUbl1myHEjw], [P], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,352][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][1], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,352][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][2], node[Uv1PDUPgS4SVUbl1myHEjw], [P], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,413][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][1<span style="color: #660;" clas
...

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks for the reply Paul, that issue definitely looks relevant to this,
although I'm curious why some of these disconnects are occurring even
before the master election fails on the second attempt.

For example on this line:
[2013-07-25 13:03:52,672][INFO ][cluster.service ] [nj01] removed
{[ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny},}, reason:
zen-disco-node_failed([ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny}), reason transport disconnected (with verified connect)

There is no indication of any pings to ny01 failing around that time. I do
not understand which underlying mechanism decides whether or not these
nodes should be removed.

On Friday, July 26, 2013 11:35:20 AM UTC-4, Keith L wrote:

Hi All, I am seeing some weirdness in terms of cluster connectivity across
multiple environments. I have two environments setup (qa and production),
both are identically setup. The configuration of each environments is 4 VMs
(ny01, ny02, nj01, nj02) with two instances of elasticsearch running on
each (two separate clusters). One cluster is running elasticsearch 0.18.7
on the standard ports (http: 9200/ tcp: 9300), this cluster has been setup
about a year and is not experiencing any sorts of problems, we will call
this one cluster A. I am currently trying to setup a second cluster using
version 0.90.0 using ports (http: 9400/ tcp: 9500), call this one cluster
B.

For cluster B I am seeing sporadic disconnects between the nodes with
elasticsearch going into odd but not quite split brain scenarios where the
view of the cluster from one VM isn't quite the same as the view from the
other VM. The settings are almost identical between cluster A and B,
minimum master nodes=3 and multicast disabled, other than that everything
is mostly vanilla. The only difference in settings is that cluster B uses
shard allocation awareness on the datacenter attribute, but i don't suspect
this to be culprit of the issue, but i could be wrong. All the problems I
am explaining with cluster B are seen in both qa and production
environments although I haven't been able to manually force any of the
issues to occur.

Some theories I've had are:

  1. It happens under heavy load
    -False, i've seen it happen during periods of inactivity and during
    constant stress
  2. Minimum number of master nodes is not being set correctly,
  • I've tried setting minimum number of master nodes through admin api and
    elasticsearch.yml directly
  1. Network congestion
  • Not completely debunked, however cluster A has no problems

Here is the view of the cluster from nj01
{
ok: true
cluster_name: newspr_production
nodes: {
EQuwUxYRSj6UAXD_4lKghg: {
name: nj01
transport_address: inet[/xxx:9500]
hostname: nj01
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: nj
}
}
Uv1PDUPgS4SVUbl1myHEjw: {
name: nj02
transport_address: inet[/xxx:9500]
hostname: nj02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: nj
}
}
NSwz2geDRR6F7WhszPA0qQ: {
name: ny02
transport_address: inet[/xxx:9500]
hostname: ny02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: ny
}
}
}
}

Here is the view of the cluster from ny01
{
ok: true
cluster_name: newspr_production
nodes: {
Uv1PDUPgS4SVUbl1myHEjw: {
name: nj02
transport_address: inet[/xxx:9500]
hostname: nj02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: nj
}
}
-XYoaJ2MSA6HaMk_CgmBFQ: {
name: ny01
transport_address: inet[/xxx:9500]
hostname: ny01
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: ny
}
}
NSwz2geDRR6F7WhszPA0qQ: {
name: ny02
transport_address: inet[/xxx:9500]
hostname: ny02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: ny
}
}
}
}

Here are the logs from ny01:
[2013-07-25 00:29:11,424][INFO ][cluster.service ] [ny01]removed
{[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},}, reason:zen
-disco-node_failed([nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj}), reason transport disconnected (with verified connect)
[2013-07-25 00:29:14,060][WARN ][discovery.zen ] [ny01] notenough master nodes
, current nodes: {[ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny},[ny02][NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=
ny},}
[2013-07-25 00:29:14,060][INFO ][cluster.service ] [ny01]removed
{[nj02][Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{datacenter=nj},[ny02][
NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen-
disco-node_failed([nj02][Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{
datacenter=nj}), reason transport disconnected (with verified connect)
[2013-07-25 00:29:17,118][INFO ][cluster.service ] [ny01]detected_master
[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj}, added {[
nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},[nj02][
Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{datacenter=nj},[ny02][
NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen-
disco-receive(from master [[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500
]]{datacenter=nj}])
[2013-07-25 08:52:32,535][INFO ][discovery.zen ] [ny01]master_left
[[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj}], reason [transport
disconnected (with verified connect)]
[2013-07-25 08:52:32,537][INFO ][cluster.service ] [ny01] master
{new [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny},previous
[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj}}, removed
{[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},}, reason:zen
-disco-master_failed ([nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj})
[2013-07-25 15:04:35,187][INFO ][cluster.service ] [ny01] added
{[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},}, reason:zen
-disco-receive(join from node[[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:
9500]]{datacenter=nj}])
[2013-07-25 15:04:35,245][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][0], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,245][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][3], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,246][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][2], node[Uv1PDUPgS4SVUbl1myHEjw], [P], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,253][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][1], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,303][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][1], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,303][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][0], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,303][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][2], node[Uv1PDUPgS4SVUbl1myHEjw], [P], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,352][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][1], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,352][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][2], node[Uv1PDUPgS4SVUbl1myHEjw], [P], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,413][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][1<span style="color: #660;" clas
...

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I've restarted the cluster and turned up the discovery and cluster logging
and noticed some interesting behavior.

Take this sequence of events:

Cluster:
1: ny01, ny02, nj01, nj02

2: nj02 detects master ny02 failed

[2013-07-30 04:59:17,212][TRACE][discovery.zen.fd ] [nj02] [master]
[[ny02][9fVeGJ7kSn6vFKTpcWRniQ][inet[/xxx:9500]]{datacenter=ny}] transport
disconnected (with verified connect)
[2013-07-30 04:59:17,215][DEBUG][discovery.zen.fd ] [nj02] [master]stopping fault detection against master
[[nyclpwssel
asc02][9fVeGJ7kSn6vFKTpcWRniQ][inet[/xxx:9500]]{datacenter=ny}], reason [master
failure, transport disconnected (with verified con
nect)]
[2013-07-30 04:59:17,216][INFO ][discovery.zen ] [nj02]master_left
[[ny02][9fVeGJ7kSn6vFKTpcWRniQ][inet[/
xxx:9500]]{datacenter=ny}], reason [transport disconnected (with verified
connect)]
[2013-07-30 04:59:17,217][DEBUG][cluster.service ] [nj02]processing
[zen-disco-master_failed ([ny02][9fVeGJ
7kSn6vFKTpcWRniQ][inet[/xxx:9500]]{datacenter=ny})]: execute
[2013-07-30 04:59:17,217][TRACE][cluster.service ] [nj02] cluster
state updated:
version [43], source [zen-disco-master_failed ([ny02][9fVeGJ7kSn6vFKTpcWRniQ
][inet[/xxx:9500]]{datacenter=ny})]
nodes:
[ny01][yuj0IIYCTKa1ieIRBLz1bQ][inet[/xxx:9500]]{datacenter=ny}
[nj01][OHFtlwC0QqueiE7ybmz8CA][inet[/xxx:9500]]{datacenter=nj}
[nj02][1E1OFkNORueWuOQL3cqu2Q][inet[/xxx:9500]]{datacenter=nj}, local,master

3: nj02 elects itself as master

[2013-07-30 04:59:17,217][INFO ][cluster.service ] [nj02] master {
new [nj02][1E1OFkNORueWuOQL3cqu2Q][inet[/xxx:9500]]{datacenter=nj},previous
[ny02][9fVeGJ7kSn6vFKTpcWRniQ][inet[/xxx:9500]]{datacenter=ny}}, removed {[
ny02][9fVeGJ7kSn6vFKTpcWRniQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen
-disco-master_failed ([ny02][9fVeGJ7kSn6vFKTpcWRniQ][inet[/xxx:9500]]{
datacenter=ny})
[2013-07-30 04:59:17,218][DEBUG][discovery.zen.publish ] [nj02] failed
to send cluster state to [[ny01][yuj0IIYCTKa1ieIRBLz1bQ][inet[/xxx:9500]]{
datacenter=ny}], should be detected as failed soon...
org.elasticsearch.transport.SendRequestTransportException: [ny01][inet[/xxx:
9500]][discovery/zen/publish]
at org.elasticsearch.transport.TransportService.sendRequest(
TransportService.java:199)
at org.elasticsearch.discovery.zen.publish.PublishClusterStateAction
.publish(PublishClusterStateAction.java:97)
at org.elasticsearch.discovery.zen.ZenDiscovery.publish(ZenDiscovery
.java:276)
at org.elasticsearch.discovery.DiscoveryService.publish(
DiscoveryService.java:115)
at org.elasticsearch.cluster.service.InternalClusterService$2.run(
InternalClusterService.java:311)
at org.elasticsearch.common.util.concurrent.
PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(
PrioritizedEsThreadPoolExecutor.java:95)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(
ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [ny01][
inet[/xxx:9500]] Node not connected
at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(
NettyTransport.java:788)
at org.elasticsearch.transport.netty.NettyTransport.sendRequest(
NettyTransport.java:522)
at org.elasticsearch.transport.TransportService.sendRequest(
TransportService.java:184)
... 8 more
[2013-07-30 04:59:17,218][DEBUG][cluster.service ] [nj02]processing
[zen-disco-master_failed ([ny02][9fVeGJ7kSn6vFKTpcWRniQ][inet[/xxx:9500]]{
datacenter=ny})]: done applying updated cluster_state
[2013-07-30 04:59:17,219][DEBUG][cluster.service ] [nj02]processing
[routing-table-updater]: execute

3: nj02 sends nj01 new cluster state and tell it that nj02 is the new master

[2013-07-30 04:59:17,220][DEBUG][cluster.service ] [nj01]processing
[zen-disco-receive(from master [[nj02][1E1OFkNORueWuOQL3cqu2Q][inet[/xxx:
9500]]{datacenter=nj}])]: execute
[2013-07-30 04:59:17,220][DEBUG][discovery.zen.fd ] [nj01] [master]restarting fault detection against master
[[nj02][1E1OFkNORueWuOQL3cqu2Q][inet[/xxx:9500]]{datacenter=nj}], reason [
new cluster stare received and we monitor the wrong master [[ny02][
9fVeGJ7kSn6vFKTpcWRniQ][inet[/xxx:9500]]{datacenter=ny}]]
[2013-07-30 04:59:17,220][TRACE][cluster.service ] [nj01] cluster
state updated:
...
[2013-07-30 04:59:17,220][INFO ][cluster.service ] [nj01] master {
new [nj02][1E1OFkNORueWuOQL3cqu2Q][inet[/xxx:9500]]{datacenter=nj},previous
[ny02][9fVeGJ7kSn6vFKTpcWRniQ][inet[/xxx:9500]]{datacenter=ny}}, removed {[
ny02][9fVeGJ7kSn6vFKTpcWRniQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen
-disco-receive(from master [[nj02][1E1OFkNORueWuOQL3cqu2Q][inet[/xxx:9500]]{
datacenter=nj}])
[2013-07-30 04:59:17,221][DEBUG][cluster.service ] [nj01]processing
[zen-disco-receive(from master [[nj02][1E1OFkNORueWuOQL3cqu2Q][inet[/xxx:
9500]]{datacenter=nj}])]: done applying updated cluster_state
[2013-07-30 04:59:17,221][DEBUG][transport.netty ] [nj01]disconnected
from [[ny02][9fVeGJ7kSn6vFKTpcWRniQ][inet[/xxx:9500]]{datacenter=ny}]
[2013-07-30 04:59:17,226][DEBUG][cluster.service ] [nj01]processing
[zen-disco-receive(from master [[nj02][1E1OFkNORueWuOQL3cqu2Q][inet[/xxx:
9500]]{datacenter=nj}])]: execute
[2013-07-30 04:59:17,227][TRACE][cluster.service ] [nj01] cluster
state updated:
version [44], source [zen-disco-receive(from master [[nj02][
1E1OFkNORueWuOQL3cqu2Q][inet[/xxx:9500]]{datacenter=nj}])]
nodes:
[ny01][yuj0IIYCTKa1ieIRBLz1bQ][inet[/xxx:9500]]{datacenter=ny}
[nj01][OHFtlwC0QqueiE7ybmz8CA][inet[/xxx:9500]]{datacenter=nj}, local
[nj02][1E1OFkNORueWuOQL3cqu2Q][inet[/xxx:9500]]{datacenter=nj}, master

Meanwhile...
ny01, ny02 have no idea about anything that is going on on the other two
nodes and we end up with the following two split brain clusters

Cluster1:
ny01, ny02(master), nj01, nj02

Cluster2:
ny01, nj01, nj02(master)

On Friday, July 26, 2013 11:35:20 AM UTC-4, Keith L wrote:

Hi All, I am seeing some weirdness in terms of cluster connectivity across
multiple environments. I have two environments setup (qa and production),
both are identically setup. The configuration of each environments is 4 VMs
(ny01, ny02, nj01, nj02) with two instances of elasticsearch running on
each (two separate clusters). One cluster is running elasticsearch 0.18.7
on the standard ports (http: 9200/ tcp: 9300), this cluster has been setup
about a year and is not experiencing any sorts of problems, we will call
this one cluster A. I am currently trying to setup a second cluster using
version 0.90.0 using ports (http: 9400/ tcp: 9500), call this one cluster
B.

For cluster B I am seeing sporadic disconnects between the nodes with
elasticsearch going into odd but not quite split brain scenarios where the
view of the cluster from one VM isn't quite the same as the view from the
other VM. The settings are almost identical between cluster A and B,
minimum master nodes=3 and multicast disabled, other than that everything
is mostly vanilla. The only difference in settings is that cluster B uses
shard allocation awareness on the datacenter attribute, but i don't suspect
this to be culprit of the issue, but i could be wrong. All the problems I
am explaining with cluster B are seen in both qa and production
environments although I haven't been able to manually force any of the
issues to occur.

Some theories I've had are:

  1. It happens under heavy load
    -False, i've seen it happen during periods of inactivity and during
    constant stress
  2. Minimum number of master nodes is not being set correctly,
  • I've tried setting minimum number of master nodes through admin api and
    elasticsearch.yml directly
  1. Network congestion
  • Not completely debunked, however cluster A has no problems

Here is the view of the cluster from nj01
{
ok: true
cluster_name: newspr_production
nodes: {
EQuwUxYRSj6UAXD_4lKghg: {
name: nj01
transport_address: inet[/xxx:9500]
hostname: nj01
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: nj
}
}
Uv1PDUPgS4SVUbl1myHEjw: {
name: nj02
transport_address: inet[/xxx:9500]
hostname: nj02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: nj
}
}
NSwz2geDRR6F7WhszPA0qQ: {
name: ny02
transport_address: inet[/xxx:9500]
hostname: ny02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: ny
}
}
}
}

Here is the view of the cluster from ny01
{
ok: true
cluster_name: newspr_production
nodes: {
Uv1PDUPgS4SVUbl1myHEjw: {
name: nj02
transport_address: inet[/xxx:9500]
hostname: nj02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: nj
}
}
-XYoaJ2MSA6HaMk_CgmBFQ: {
name: ny01
transport_address: inet[/xxx:9500]
hostname: ny01
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: ny
}
}
NSwz2geDRR6F7WhszPA0qQ: {
name: ny02
transport_address: inet[/xxx:9500]
hostname: ny02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: ny
}
}
}
}

Here are the logs from ny01:
[2013-07-25 00:29:11,424][INFO ][cluster.service ] [ny01]removed
{[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},}, reason:zen
-disco-node_failed([nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj}), reason transport disconnected (with verified connect)
[2013-07-25 00:29:14,060][WARN ][discovery.zen ] [ny01] notenough master nodes
, current nodes: {[ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny},[ny02][NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=
ny},}
[2013-07-25 00:29:14,060][INFO ][cluster.service ] [ny01]removed
{[nj02][Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{datacenter=nj},[ny02][
NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen-
disco-node_failed([nj02][Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{
datacenter=nj}), reason transport disconnected (with verified connect)
[2013-07-25 00:29:17,118][INFO ][cluster.service ] [ny01]detected_master
[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj}, added {[
nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},[nj02][
Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{datacenter=nj},[ny02][
NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen-
disco-receive(from master [[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500
]]{datacenter=nj}])
[2013-07-25 08:52:32,535][INFO ][discovery.zen ] [ny01]master_left
[[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj}], reason [transport
disconnected (with verified connect)]
[2013-07-25 08:52:32,537][INFO ][cluster.service ] [ny01] master
{new [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny},previous
[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj}}, removed
{[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},}, reason:zen
-disco-master_failed ([nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj})
[2013-07-25 15:04:35,187][INFO ][cluster.service ] [ny01] added
{[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},}, reason:zen
-disco-receive(join from node[[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:
9500]]{datacenter=nj}])
[2013-07-25 15:04:35,245][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][0], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,245][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][3], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,246][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][2], node[Uv1PDUPgS4SVUbl1myHEjw], [P], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,253][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][1], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,303][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][1], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,303][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][0], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,303][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][2], node[Uv1PDUPgS4SVUbl1myHEjw], [P], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,352][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][1], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,352][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][2], node[Uv1PDUPgS4SVUbl1myHEjw], [P], s[STARTED],reason
[master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}marked shard
as started, but shard have not been created, mark shard as failed]
[2013-07-25 15:04:35,413][WARN ][cluster.action.shard ] [ny01]received shard failed
for [press_release][1<span style="color: #660;" clas
...

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.