Seeing weirdness with zen_disco_multicast

Hi All, I am seeing some weirdness in terms of cluster connectivity across
multiple environments. I have two environments setup (qa and production),
both are identically setup. The configuration of each environments is 4 VMs
(ny01, ny02, nj01, nj02) with two instances of elasticsearch running on
each (two separate clusters). One cluster is running elasticsearch 0.18.7
on the standard ports (http: 9200/ tcp: 9300), this cluster has been setup
about a year and is not experiencing any sorts of problems, we will call
this one cluster A. I am currently trying to setup a second cluster using
version 0.90.0 using ports (http: 9400/ tcp: 9500), call this one cluster
B.

For cluster B I am seeing sporadic disconnects between the nodes with
elasticsearch going into odd but not quite split brain scenarios where the
view of the cluster from one VM isn't quite the same as the view from the
other VM. The settings are almost identical between cluster A and B,
minimum master nodes=3 and multicast disabled, other than that everything
is mostly vanilla. The only difference in settings is that cluster B uses
shard allocation awareness on the datacenter attribute, but i don't suspect
this to be culprit of the issue, but i could be wrong. All the problems I
am explaining with cluster B are seen in both qa and production
environments although I haven't been able to manually force any of the
issues to occur.

Some theories I've had are:

  1. It happens under heavy load
    -False, i've seen it happen during periods of inactivity and during
    constant stress
  2. Minimum number of master nodes is not being set correctly,
  • I've tried setting minimum number of master nodes through admin api and
    elasticsearch.yml directly
  1. Network congestion
  • Not completely debunked, however cluster A has no problems

Here is the view of the cluster from nj01
{
ok: true
cluster_name: newspr_production
nodes: {
EQuwUxYRSj6UAXD_4lKghg: {
name: nj01
transport_address: inet[/xxx:9500]
hostname: nj01
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: nj
}
}
Uv1PDUPgS4SVUbl1myHEjw: {
name: nj02
transport_address: inet[/xxx:9500]
hostname: nj02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: nj
}
}
NSwz2geDRR6F7WhszPA0qQ: {
name: ny02
transport_address: inet[/xxx:9500]
hostname: ny02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: ny
}
}
}
}

Here is the view of the cluster from ny01
{
ok: true
cluster_name: newspr_production
nodes: {
Uv1PDUPgS4SVUbl1myHEjw: {
name: nj02
transport_address: inet[/xxx:9500]
hostname: nj02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: nj
}
}
-XYoaJ2MSA6HaMk_CgmBFQ: {
name: ny01
transport_address: inet[/xxx:9500]
hostname: ny01
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: ny
}
}
NSwz2geDRR6F7WhszPA0qQ: {
name: ny02
transport_address: inet[/xxx:9500]
hostname: ny02
version: 0.90.0
http_address: inet[/xxx:9400]
attributes: {
datacenter: ny
}
}
}
}

Here are logs from ny01:

[2013-07-25 00:29:11,424][INFO ][cluster.service ] [ny01] removed
{[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},}, reason:zen
-disco-node_failed([nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj}), reason transport disconnected (with verified connect)
[2013-07-25 00:29:14,060][WARN ][discovery.zen ] [ny01] notenough master nodes
, current nodes: {[ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter
=ny},[ny02][NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=ny},}
[2013-07-25 00:29:14,060][INFO ][cluster.service ] [ny01] removed
{[nj02][Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{datacenter=nj},[ny02][
NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen-disco
-node_failed([nj02][Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{datacenter=nj
}), reason transport disconnected (with verified connect)
[2013-07-25 00:29:17,118][INFO ][cluster.service ] [ny01]detected_master
[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj}, added {[nj01
][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},[nj02][
Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{datacenter=nj},[ny02][
NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen-disco
-receive(from master [[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj}])
[2013-07-25 08:52:32,535][INFO ][discovery.zen ] [ny01]master_left
[[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj}], reason [transport
disconnected (with verified connect)]
[2013-07-25 08:52:32,537][INFO ][cluster.service ] [ny01] master {
new [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny},previous
[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj}}, removed {[
nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},}, reason: zen
-disco-master_failed ([nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj})
[2013-07-25 15:04:35,187][INFO ][cluster.service ] [ny01] added {[
nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},}, reason: zen
-disco-receive(join from node[[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500
]]{datacenter=nj}])
[2013-07-25 15:04:35,245][WARN ][cluster.action.shard ] [ny01] received
shard failed for [press_release][0], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[
STARTED], reason [master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 15:04:35,245][WARN ][cluster.action.shard ] [ny01] received
shard failed for [press_release][3], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[
STARTED], reason [master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 15:04:35,246][WARN ][cluster.action.shard ] [ny01] received
shard failed for [press_release][2], node[Uv1PDUPgS4SVUbl1myHEjw], [P], s[
STARTED], reason [master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 15:04:35,253][WARN ][cluster.action.shard ] [ny01] received
shard failed for [press_release][1], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[
STARTED], reason [master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 15:04:35,303][WARN ][cluster.action.shard ] [ny01] received
shard failed for [press_release][1], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[
STARTED], reason [master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 15:04:35,303][WARN ][cluster.action.shard ] [ny01] received
shard failed for [press_release][0], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[
STARTED], reason [master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 15:04:35,303][WARN ][cluster.action.shard ] [ny01] received
shard failed for [press_release][2], node[Uv1PDUPgS4SVUbl1myHEjw], [P], s[
STARTED], reason [master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 15:04:35,352][WARN ][cluster.action.shard ] [ny01] received
shard failed for [press_release][1], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[
STARTED], reason [master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 15:04:35,352][WARN ][cluster.action.shard ] [ny01] received
shard failed for [press_release][2], node[Uv1PDUPgS4SVUbl1myHEjw], [P], s[
STARTED], reason [master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 15:04:35,413][WARN ][cluster.action.shard ] [ny01] received
shard failed for [press_release][1], node[Uv1PDUPgS4SVUbl1myHEjw], [R], s[
STARTED], reason [master [ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 15:04:35,779][WARN ][discovery.zen ] [ny01] received
a join request for an existing node [[nj02][Uv1PDUPgS4SVUbl1myHEjw][inet[/
xxx:9500]]{datacenter=nj}]

And the logs from nj01:
[2013-07-25 00:29:12,043][INFO ][discovery.zen ] [nj01]master_left
[[ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}], reason [do
not exists on master, act as master failure]
[2013-07-25 00:29:12,045][INFO ][cluster.service ] [nj01] master {
new [nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},previous
[ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}}, removed {[
ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen
-disco-master_failed ([ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny})
[2013-07-25 00:29:17,111][INFO ][cluster.service ] [nj01] added {[
ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen
-disco-receive(join from node[[ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500
]]{datacenter=ny}])
[2013-07-25 04:41:16,491][WARN ][discovery.zen ] [nj01] received
a join request for an existing node [[ny02][NSwz2geDRR6F7WhszPA0qQ][inet[/
xxx:9500]]{datacenter=ny}]
[2013-07-25 04:41:16,545][WARN ][cluster.action.shard ] [nj01] received
shard failed for [press_release][0], node[NSwz2geDRR6F7WhszPA0qQ], [R], s[
STARTED], reason [master [nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 04:41:16,549][WARN ][cluster.action.shard ] [nj01] received
shard failed for [press_release][1], node[NSwz2geDRR6F7WhszPA0qQ], [R], s[
STARTED], reason [master [nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 04:41:16,550][WARN ][cluster.action.shard ] [nj01] received
shard failed for [press_release][3], node[NSwz2geDRR6F7WhszPA0qQ], [P], s[
STARTED], reason [master [nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 04:41:16,546][WARN ][cluster.action.shard ] [nj01] received
shard failed for [press_release][2], node[NSwz2geDRR6F7WhszPA0qQ], [R], s[
STARTED], reason [master [nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 04:41:16,603][WARN ][cluster.action.shard ] [nj01] received
shard failed for [press_release][2], node[NSwz2geDRR6F7WhszPA0qQ], [R], s[
STARTED], reason [master [nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 04:41:16,607][WARN ][cluster.action.shard ] [nj01] received
shard failed for [press_release][1], node[NSwz2geDRR6F7WhszPA0qQ], [R], s[
STARTED], reason [master [nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 04:41:16,607][WARN ][cluster.action.shard ] [nj01] received
shard failed for [press_release][3], node[NSwz2geDRR6F7WhszPA0qQ], [P], s[
STARTED], reason [master [nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 04:41:16,674][WARN ][cluster.action.shard ] [nj01] received
shard failed for [press_release][1], node[NSwz2geDRR6F7WhszPA0qQ], [R], s[
STARTED], reason [master [nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 04:41:16,676][WARN ][cluster.action.shard ] [nj01] received
shard failed for [press_release][2], node[NSwz2geDRR6F7WhszPA0qQ], [R], s[
STARTED], reason [master [nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 04:41:16,676][WARN ][cluster.action.shard ] [nj01] received
shard failed for [press_release][1], node[NSwz2geDRR6F7WhszPA0qQ], [R], s[
STARTED], reason [master [nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{
datacenter=nj} marked shard as started, but shard have not been created,mark shard
as failed]
[2013-07-25 13:03:52,669][WARN ][transport ] [nj01] Receivedresponse
for a request that has timed out, sent [44493ms] ago, timed out [14482ms]ago
, action [discovery/zen/fd/ping], node [[nj02][Uv1PDUPgS4SVUbl1myHEjw][inet
[/xxx:9500]]{datacenter=nj}], id [245699]
[2013-07-25 13:03:52,672][INFO ][cluster.service ] [nj01] removed
{[ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny},}, reason:zen
-disco-node_failed([ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny}), reason transport disconnected (with verified connect)
[2013-07-25 15:04:32,137][WARN ][discovery.zen ] [nj01] notenough master nodes
, current nodes: {[nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter
=nj},[nj02][Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{datacenter=nj},}
[2013-07-25 15:04:32,138][INFO ][cluster.service ] [nj01] removed
{[nj02][Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{datacenter=nj},[ny02][
NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen-disco
-node_failed([ny02][NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=ny
}), reason transport disconnected (with verified connect)
[2013-07-25 15:04:32,139][WARN ][transport ] [nj01] Receivedresponse
for a request that has timed out, sent [44355ms] ago, timed out [14344ms]ago
, action [discovery/zen/fd/ping], node [[nj02][Uv1PDUPgS4SVUbl1myHEjw][inet
[/xxx:9500]]{datacenter=nj}], id [274173]
[2013-07-25 15:04:35,190][INFO ][cluster.service ] [nj01]detected_master
[ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}, added {[ny01
][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny},[nj02][
Uv1PDUPgS4SVUbl1myHEjw][inet[/xxx:9500]]{datacenter=nj},[ny02][
NSwz2geDRR6F7WhszPA0qQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen-disco
-receive(from master [[ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny}])
[2013-07-25 19:16:35,360][INFO ][discovery.zen ] [nj01]master_left
[[ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}], reason [transport
disconnected (with verified connect)]
[2013-07-25 19:16:35,360][INFO ][cluster.service ] [nj01] master {
new [nj01][EQuwUxYRSj6UAXD_4lKghg][inet[/xxx:9500]]{datacenter=nj},previous
[ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny}}, removed {[
ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{datacenter=ny},}, reason: zen
-disco-master_failed ([ny01][-XYoaJ2MSA6HaMk_CgmBFQ][inet[/xxx:9500]]{
datacenter=ny})

Thanks to all the perspective undertakers! Please let me know if more
detailed information is required.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.