Network failure resiliency


(Grant) #1

Hi folks,
So we're currently running an 8 node ES cluster with a well known
hosting provider. About thrice now we've have issues that I'm
attributing to network failure for a number of reasons, but I was
wondering if there's some consensus on what ES's resiliency to these
types of failures should be.

Essentially, what happens is that all or a majority of nodes in the
cluster experience network failures. The failures have generally been
in the several minutes timeframe. During this time, connectivity is
either completely lost or intermittent.

When the network stabilizes again, what I find is that the nodes in
the cluster are in a very confused state. Some will report red, some
yellow, and they all seem to stay in a constant state of trying to
recover, but in order to do so I have to completely restart the
cluster.

Obviously we're working with our provider to determine why we're
having these network outages, but I was wondering if any testing has
been done to replicate this kind of failure scenario, and if there's a
reasonable expectation that the cluster should recover on its own in
such cases.

Thanks!


(Ævar Arnfjörð Bjarmason) #2

You might want to try switching from multicast to unicast just to
eliminate a variable.

Some networks don't treat multicast traffic very well.

It's also useful to look at the logs for the ES nodes during these
outages. What do they say?


(Grant) #3

We're using unicast now (Rackspace doesn't allow multicast traffic).

Here's a sample of what's in the logs during the issues. This kind of
things was steaming pretty much continuously:

[2012-01-16 02:52:41,711][WARN ][indices.cluster ] [prod-es-
r03] [contact_documents-527859-0][0] master [[prod-es-r06][IfNWkYASSg-
TOZuMI7nj5w][inet[/10.180.46.203:9300]]] marked shard as started, but
shard have not been created, mark shard as failed
[2012-01-16 02:52:41,711][WARN ][cluster.action.shard ] [prod-es-
r03] sending failed shard for [contact_documents-527859-0][0],
node[zB6rqHbHQrm727WdL5iXrw], [R], s[STARTED], reason [master [prod-es-
r06][IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]] marked shard
as started, but shard have not been created, mark shard as failed]
[2012-01-16 02:52:41,880][WARN ][indices.cluster ] [prod-es-
r03] [contact_documents-194054-1322678627][0] master [[prod-es-r06]
[IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]]] marked shard as
started, but shard have not been created, mark shard as failed
[2012-01-16 02:52:41,880][WARN ][cluster.action.shard ] [prod-es-
r03] sending failed shard for [contact_documents-194054-1322678627]
[0], node[zB6rqHbHQrm727WdL5iXrw], [R], s[STARTED], reason [master
[prod-es-r06][IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]]
marked shard as started, but shard have not been created, mark shard
as failed]
[2012-01-16 02:52:41,894][WARN ][indices.cluster ] [prod-es-
r03] [contact_documents-527859-0][0] master [[prod-es-r06][IfNWkYASSg-
TOZuMI7nj5w][inet[/10.180.46.203:9300]]] marked shard as started, but
shard have not been created, mark shard as failed
[2012-01-16 02:52:41,894][WARN ][cluster.action.shard ] [prod-es-
r03] sending failed shard for [contact_documents-527859-0][0],
node[zB6rqHbHQrm727WdL5iXrw], [R], s[STARTED], reason [master [prod-es-
r06][IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]] marked shard
as started, but shard have not been created, mark shard as failed]

On Jan 16, 2:57 pm, Ævar Arnfjörð Bjarmason ava...@gmail.com wrote:

You might want to try switching from multicast to unicast just to
eliminate a variable.

Some networks don't treat multicast traffic very well.

It's also useful to look at the logs for the ES nodes during these
outages. What do they say?


(Shay Banon) #4

I suggest you set discovery.zen.minimum_master_nodes to a higher value, in
your case, something like 2 or 3. Then, if a node looses connection to
other nodes, it will not "form its own cluster", but will try and rejoin
and forma cluster with that minimum specified.

On Mon, Jan 16, 2012 at 10:16 PM, Grant grant@brewster.com wrote:

We're using unicast now (Rackspace doesn't allow multicast traffic).

Here's a sample of what's in the logs during the issues. This kind of
things was steaming pretty much continuously:

[2012-01-16 02:52:41,711][WARN ][indices.cluster ] [prod-es-
r03] [contact_documents-527859-0][0] master [[prod-es-r06][IfNWkYASSg-
TOZuMI7nj5w][inet[/10.180.46.203:9300]]] marked shard as started, but
shard have not been created, mark shard as failed
[2012-01-16 02:52:41,711][WARN ][cluster.action.shard ] [prod-es-
r03] sending failed shard for [contact_documents-527859-0][0],
node[zB6rqHbHQrm727WdL5iXrw], [R], s[STARTED], reason [master [prod-es-
r06][IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]] marked shard
as started, but shard have not been created, mark shard as failed]
[2012-01-16 02:52:41,880][WARN ][indices.cluster ] [prod-es-
r03] [contact_documents-194054-1322678627][0] master [[prod-es-r06]
[IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]]] marked shard as
started, but shard have not been created, mark shard as failed
[2012-01-16 02:52:41,880][WARN ][cluster.action.shard ] [prod-es-
r03] sending failed shard for [contact_documents-194054-1322678627]
[0], node[zB6rqHbHQrm727WdL5iXrw], [R], s[STARTED], reason [master
[prod-es-r06][IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]]
marked shard as started, but shard have not been created, mark shard
as failed]
[2012-01-16 02:52:41,894][WARN ][indices.cluster ] [prod-es-
r03] [contact_documents-527859-0][0] master [[prod-es-r06][IfNWkYASSg-
TOZuMI7nj5w][inet[/10.180.46.203:9300]]] marked shard as started, but
shard have not been created, mark shard as failed
[2012-01-16 02:52:41,894][WARN ][cluster.action.shard ] [prod-es-
r03] sending failed shard for [contact_documents-527859-0][0],
node[zB6rqHbHQrm727WdL5iXrw], [R], s[STARTED], reason [master [prod-es-
r06][IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]] marked shard
as started, but shard have not been created, mark shard as failed]

On Jan 16, 2:57 pm, Ævar Arnfjörð Bjarmason ava...@gmail.com wrote:

You might want to try switching from multicast to unicast just to
eliminate a variable.

Some networks don't treat multicast traffic very well.

It's also useful to look at the logs for the ES nodes during these
outages. What do they say?


(Grant) #5

Hi Shay!

Believe it or not, we already run with minimum master nodes set to
3...

On Jan 17, 4:51 am, Shay Banon kim...@gmail.com wrote:

I suggest you set discovery.zen.minimum_master_nodes to a higher value, in
your case, something like 2 or 3. Then, if a node looses connection to
other nodes, it will not "form its own cluster", but will try and rejoin
and forma cluster with that minimum specified.

On Mon, Jan 16, 2012 at 10:16 PM, Grant gr...@brewster.com wrote:

We're using unicast now (Rackspace doesn't allow multicast traffic).

Here's a sample of what's in the logs during the issues. This kind of
things was steaming pretty much continuously:

[2012-01-16 02:52:41,711][WARN ][indices.cluster ] [prod-es-
r03] [contact_documents-527859-0][0] master [[prod-es-r06][IfNWkYASSg-
TOZuMI7nj5w][inet[/10.180.46.203:9300]]] marked shard as started, but
shard have not been created, mark shard as failed
[2012-01-16 02:52:41,711][WARN ][cluster.action.shard ] [prod-es-
r03] sending failed shard for [contact_documents-527859-0][0],
node[zB6rqHbHQrm727WdL5iXrw], [R], s[STARTED], reason [master [prod-es-
r06][IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]] marked shard
as started, but shard have not been created, mark shard as failed]
[2012-01-16 02:52:41,880][WARN ][indices.cluster ] [prod-es-
r03] [contact_documents-194054-1322678627][0] master [[prod-es-r06]
[IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]]] marked shard as
started, but shard have not been created, mark shard as failed
[2012-01-16 02:52:41,880][WARN ][cluster.action.shard ] [prod-es-
r03] sending failed shard for [contact_documents-194054-1322678627]
[0], node[zB6rqHbHQrm727WdL5iXrw], [R], s[STARTED], reason [master
[prod-es-r06][IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]]
marked shard as started, but shard have not been created, mark shard
as failed]
[2012-01-16 02:52:41,894][WARN ][indices.cluster ] [prod-es-
r03] [contact_documents-527859-0][0] master [[prod-es-r06][IfNWkYASSg-
TOZuMI7nj5w][inet[/10.180.46.203:9300]]] marked shard as started, but
shard have not been created, mark shard as failed
[2012-01-16 02:52:41,894][WARN ][cluster.action.shard ] [prod-es-
r03] sending failed shard for [contact_documents-527859-0][0],
node[zB6rqHbHQrm727WdL5iXrw], [R], s[STARTED], reason [master [prod-es-
r06][IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]] marked shard
as started, but shard have not been created, mark shard as failed]

On Jan 16, 2:57 pm, Ævar Arnfjörð Bjarmason ava...@gmail.com wrote:

You might want to try switching from multicast to unicast just to
eliminate a variable.

Some networks don't treat multicast traffic very well.

It's also useful to look at the logs for the ES nodes during these
outages. What do they say?


(Shay Banon) #6

Then next time it happens, can you dropbox the logs of the nodes?

On Tue, Jan 17, 2012 at 2:29 PM, Grant grant@brewster.com wrote:

Hi Shay!

Believe it or not, we already run with minimum master nodes set to
3...

On Jan 17, 4:51 am, Shay Banon kim...@gmail.com wrote:

I suggest you set discovery.zen.minimum_master_nodes to a higher value,
in
your case, something like 2 or 3. Then, if a node looses connection to
other nodes, it will not "form its own cluster", but will try and rejoin
and forma cluster with that minimum specified.

On Mon, Jan 16, 2012 at 10:16 PM, Grant gr...@brewster.com wrote:

We're using unicast now (Rackspace doesn't allow multicast traffic).

Here's a sample of what's in the logs during the issues. This kind of
things was steaming pretty much continuously:

[2012-01-16 02:52:41,711][WARN ][indices.cluster ] [prod-es-
r03] [contact_documents-527859-0][0] master [[prod-es-r06][IfNWkYASSg-
TOZuMI7nj5w][inet[/10.180.46.203:9300]]] marked shard as started, but
shard have not been created, mark shard as failed
[2012-01-16 02:52:41,711][WARN ][cluster.action.shard ] [prod-es-
r03] sending failed shard for [contact_documents-527859-0][0],
node[zB6rqHbHQrm727WdL5iXrw], [R], s[STARTED], reason [master [prod-es-
r06][IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]] marked shard
as started, but shard have not been created, mark shard as failed]
[2012-01-16 02:52:41,880][WARN ][indices.cluster ] [prod-es-
r03] [contact_documents-194054-1322678627][0] master [[prod-es-r06]
[IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]]] marked shard as
started, but shard have not been created, mark shard as failed
[2012-01-16 02:52:41,880][WARN ][cluster.action.shard ] [prod-es-
r03] sending failed shard for [contact_documents-194054-1322678627]
[0], node[zB6rqHbHQrm727WdL5iXrw], [R], s[STARTED], reason [master
[prod-es-r06][IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]]
marked shard as started, but shard have not been created, mark shard
as failed]
[2012-01-16 02:52:41,894][WARN ][indices.cluster ] [prod-es-
r03] [contact_documents-527859-0][0] master [[prod-es-r06][IfNWkYASSg-
TOZuMI7nj5w][inet[/10.180.46.203:9300]]] marked shard as started, but
shard have not been created, mark shard as failed
[2012-01-16 02:52:41,894][WARN ][cluster.action.shard ] [prod-es-
r03] sending failed shard for [contact_documents-527859-0][0],
node[zB6rqHbHQrm727WdL5iXrw], [R], s[STARTED], reason [master [prod-es-
r06][IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]] marked shard
as started, but shard have not been created, mark shard as failed]

On Jan 16, 2:57 pm, Ævar Arnfjörð Bjarmason ava...@gmail.com wrote:

You might want to try switching from multicast to unicast just to
eliminate a variable.

Some networks don't treat multicast traffic very well.

It's also useful to look at the logs for the ES nodes during these
outages. What do they say?


(Grant) #7

I still have logs if you'd be interested in having a look. Let me grab
them...

On Jan 17, 12:28 pm, Shay Banon kim...@gmail.com wrote:

Then next time it happens, can you dropbox the logs of the nodes?

On Tue, Jan 17, 2012 at 2:29 PM, Grant gr...@brewster.com wrote:

Hi Shay!

Believe it or not, we already run with minimum master nodes set to
3...

On Jan 17, 4:51 am, Shay Banon kim...@gmail.com wrote:

I suggest you set discovery.zen.minimum_master_nodes to a higher value,
in
your case, something like 2 or 3. Then, if a node looses connection to
other nodes, it will not "form its own cluster", but will try and rejoin
and forma cluster with that minimum specified.

On Mon, Jan 16, 2012 at 10:16 PM, Grant gr...@brewster.com wrote:

We're using unicast now (Rackspace doesn't allow multicast traffic).

Here's a sample of what's in the logs during the issues. This kind of
things was steaming pretty much continuously:

[2012-01-16 02:52:41,711][WARN ][indices.cluster ] [prod-es-
r03] [contact_documents-527859-0][0] master [[prod-es-r06][IfNWkYASSg-
TOZuMI7nj5w][inet[/10.180.46.203:9300]]] marked shard as started, but
shard have not been created, mark shard as failed
[2012-01-16 02:52:41,711][WARN ][cluster.action.shard ] [prod-es-
r03] sending failed shard for [contact_documents-527859-0][0],
node[zB6rqHbHQrm727WdL5iXrw], [R], s[STARTED], reason [master [prod-es-
r06][IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]] marked shard
as started, but shard have not been created, mark shard as failed]
[2012-01-16 02:52:41,880][WARN ][indices.cluster ] [prod-es-
r03] [contact_documents-194054-1322678627][0] master [[prod-es-r06]
[IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]]] marked shard as
started, but shard have not been created, mark shard as failed
[2012-01-16 02:52:41,880][WARN ][cluster.action.shard ] [prod-es-
r03] sending failed shard for [contact_documents-194054-1322678627]
[0], node[zB6rqHbHQrm727WdL5iXrw], [R], s[STARTED], reason [master
[prod-es-r06][IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]]
marked shard as started, but shard have not been created, mark shard
as failed]
[2012-01-16 02:52:41,894][WARN ][indices.cluster ] [prod-es-
r03] [contact_documents-527859-0][0] master [[prod-es-r06][IfNWkYASSg-
TOZuMI7nj5w][inet[/10.180.46.203:9300]]] marked shard as started, but
shard have not been created, mark shard as failed
[2012-01-16 02:52:41,894][WARN ][cluster.action.shard ] [prod-es-
r03] sending failed shard for [contact_documents-527859-0][0],
node[zB6rqHbHQrm727WdL5iXrw], [R], s[STARTED], reason [master [prod-es-
r06][IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]] marked shard
as started, but shard have not been created, mark shard as failed]

On Jan 16, 2:57 pm, Ævar Arnfjörð Bjarmason ava...@gmail.com wrote:

You might want to try switching from multicast to unicast just to
eliminate a variable.

Some networks don't treat multicast traffic very well.

It's also useful to look at the logs for the ES nodes during these
outages. What do they say?


(Grant) #8

As an aside, after talking with our provider, while all our nodes are
on different physicals, six of the 8 exist in the same huddle, so they
share a switch. My suspicion is the switch was either rebooted or was
flapping.

On Jan 17, 12:58 pm, Grant gr...@brewster.com wrote:

I still have logs if you'd be interested in having a look. Let me grab
them...

On Jan 17, 12:28 pm, Shay Banon kim...@gmail.com wrote:

Then next time it happens, can you dropbox the logs of the nodes?

On Tue, Jan 17, 2012 at 2:29 PM, Grant gr...@brewster.com wrote:

Hi Shay!

Believe it or not, we already run with minimum master nodes set to
3...

On Jan 17, 4:51 am, Shay Banon kim...@gmail.com wrote:

I suggest you set discovery.zen.minimum_master_nodes to a higher value,
in
your case, something like 2 or 3. Then, if a node looses connection to
other nodes, it will not "form its own cluster", but will try and rejoin
and forma cluster with that minimum specified.

On Mon, Jan 16, 2012 at 10:16 PM, Grant gr...@brewster.com wrote:

We're using unicast now (Rackspace doesn't allow multicast traffic).

Here's a sample of what's in the logs during the issues. This kind of
things was steaming pretty much continuously:

[2012-01-16 02:52:41,711][WARN ][indices.cluster ] [prod-es-
r03] [contact_documents-527859-0][0] master [[prod-es-r06][IfNWkYASSg-
TOZuMI7nj5w][inet[/10.180.46.203:9300]]] marked shard as started, but
shard have not been created, mark shard as failed
[2012-01-16 02:52:41,711][WARN ][cluster.action.shard ] [prod-es-
r03] sending failed shard for [contact_documents-527859-0][0],
node[zB6rqHbHQrm727WdL5iXrw], [R], s[STARTED], reason [master [prod-es-
r06][IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]] marked shard
as started, but shard have not been created, mark shard as failed]
[2012-01-16 02:52:41,880][WARN ][indices.cluster ] [prod-es-
r03] [contact_documents-194054-1322678627][0] master [[prod-es-r06]
[IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]]] marked shard as
started, but shard have not been created, mark shard as failed
[2012-01-16 02:52:41,880][WARN ][cluster.action.shard ] [prod-es-
r03] sending failed shard for [contact_documents-194054-1322678627]
[0], node[zB6rqHbHQrm727WdL5iXrw], [R], s[STARTED], reason [master
[prod-es-r06][IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]]
marked shard as started, but shard have not been created, mark shard
as failed]
[2012-01-16 02:52:41,894][WARN ][indices.cluster ] [prod-es-
r03] [contact_documents-527859-0][0] master [[prod-es-r06][IfNWkYASSg-
TOZuMI7nj5w][inet[/10.180.46.203:9300]]] marked shard as started, but
shard have not been created, mark shard as failed
[2012-01-16 02:52:41,894][WARN ][cluster.action.shard ] [prod-es-
r03] sending failed shard for [contact_documents-527859-0][0],
node[zB6rqHbHQrm727WdL5iXrw], [R], s[STARTED], reason [master [prod-es-
r06][IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]] marked shard
as started, but shard have not been created, mark shard as failed]

On Jan 16, 2:57 pm, Ævar Arnfjörð Bjarmason ava...@gmail.com wrote:

You might want to try switching from multicast to unicast just to
eliminate a variable.

Some networks don't treat multicast traffic very well.

It's also useful to look at the logs for the ES nodes during these
outages. What do they say?


(Grant) #9

Shay: on dropbox, sent you an invite.

Thanks for any help,
-G

On Jan 17, 2:21 pm, Grant gr...@brewster.com wrote:

As an aside, after talking with our provider, while all our nodes are
on different physicals, six of the 8 exist in the same huddle, so they
share a switch. My suspicion is the switch was either rebooted or was
flapping.

On Jan 17, 12:58 pm, Grant gr...@brewster.com wrote:

I still have logs if you'd be interested in having a look. Let me grab
them...

On Jan 17, 12:28 pm, Shay Banon kim...@gmail.com wrote:

Then next time it happens, can you dropbox the logs of the nodes?

On Tue, Jan 17, 2012 at 2:29 PM, Grant gr...@brewster.com wrote:

Hi Shay!

Believe it or not, we already run with minimum master nodes set to
3...

On Jan 17, 4:51 am, Shay Banon kim...@gmail.com wrote:

I suggest you set discovery.zen.minimum_master_nodes to a higher value,
in
your case, something like 2 or 3. Then, if a node looses connection to
other nodes, it will not "form its own cluster", but will try and rejoin
and forma cluster with that minimum specified.

On Mon, Jan 16, 2012 at 10:16 PM, Grant gr...@brewster.com wrote:

We're using unicast now (Rackspace doesn't allow multicast traffic).

Here's a sample of what's in the logs during the issues. This kind of
things was steaming pretty much continuously:

[2012-01-16 02:52:41,711][WARN ][indices.cluster ] [prod-es-
r03] [contact_documents-527859-0][0] master [[prod-es-r06][IfNWkYASSg-
TOZuMI7nj5w][inet[/10.180.46.203:9300]]] marked shard as started, but
shard have not been created, mark shard as failed
[2012-01-16 02:52:41,711][WARN ][cluster.action.shard ] [prod-es-
r03] sending failed shard for [contact_documents-527859-0][0],
node[zB6rqHbHQrm727WdL5iXrw], [R], s[STARTED], reason [master [prod-es-
r06][IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]] marked shard
as started, but shard have not been created, mark shard as failed]
[2012-01-16 02:52:41,880][WARN ][indices.cluster ] [prod-es-
r03] [contact_documents-194054-1322678627][0] master [[prod-es-r06]
[IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]]] marked shard as
started, but shard have not been created, mark shard as failed
[2012-01-16 02:52:41,880][WARN ][cluster.action.shard ] [prod-es-
r03] sending failed shard for [contact_documents-194054-1322678627]
[0], node[zB6rqHbHQrm727WdL5iXrw], [R], s[STARTED], reason [master
[prod-es-r06][IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]]
marked shard as started, but shard have not been created, mark shard
as failed]
[2012-01-16 02:52:41,894][WARN ][indices.cluster ] [prod-es-
r03] [contact_documents-527859-0][0] master [[prod-es-r06][IfNWkYASSg-
TOZuMI7nj5w][inet[/10.180.46.203:9300]]] marked shard as started, but
shard have not been created, mark shard as failed
[2012-01-16 02:52:41,894][WARN ][cluster.action.shard ] [prod-es-
r03] sending failed shard for [contact_documents-527859-0][0],
node[zB6rqHbHQrm727WdL5iXrw], [R], s[STARTED], reason [master [prod-es-
r06][IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]] marked shard
as started, but shard have not been created, mark shard as failed]

On Jan 16, 2:57 pm, Ævar Arnfjörð Bjarmason ava...@gmail.com wrote:

You might want to try switching from multicast to unicast just to
eliminate a variable.

Some networks don't treat multicast traffic very well.

It's also useful to look at the logs for the ES nodes during these
outages. What do they say?


(Shay Banon) #10

Can you place it in dropbox under Public, and just send me a public link to
download it? Have problems with sharing folders.

On Wed, Jan 18, 2012 at 7:51 PM, Grant grant@brewster.com wrote:

Shay: on dropbox, sent you an invite.

Thanks for any help,
-G

On Jan 17, 2:21 pm, Grant gr...@brewster.com wrote:

As an aside, after talking with our provider, while all our nodes are
on different physicals, six of the 8 exist in the same huddle, so they
share a switch. My suspicion is the switch was either rebooted or was
flapping.

On Jan 17, 12:58 pm, Grant gr...@brewster.com wrote:

I still have logs if you'd be interested in having a look. Let me grab
them...

On Jan 17, 12:28 pm, Shay Banon kim...@gmail.com wrote:

Then next time it happens, can you dropbox the logs of the nodes?

On Tue, Jan 17, 2012 at 2:29 PM, Grant gr...@brewster.com wrote:

Hi Shay!

Believe it or not, we already run with minimum master nodes set to
3...

On Jan 17, 4:51 am, Shay Banon kim...@gmail.com wrote:

I suggest you set discovery.zen.minimum_master_nodes to a higher
value,

in

your case, something like 2 or 3. Then, if a node looses
connection to

other nodes, it will not "form its own cluster", but will try
and rejoin

and forma cluster with that minimum specified.

On Mon, Jan 16, 2012 at 10:16 PM, Grant gr...@brewster.com
wrote:

We're using unicast now (Rackspace doesn't allow multicast
traffic).

Here's a sample of what's in the logs during the issues. This
kind of

things was steaming pretty much continuously:

[2012-01-16 02:52:41,711][WARN ][indices.cluster ]
[prod-es-

r03] [contact_documents-527859-0][0] master
[[prod-es-r06][IfNWkYASSg-

TOZuMI7nj5w][inet[/10.180.46.203:9300]]] marked shard as
started, but

shard have not been created, mark shard as failed
[2012-01-16 02:52:41,711][WARN ][cluster.action.shard ]
[prod-es-

r03] sending failed shard for [contact_documents-527859-0][0],
node[zB6rqHbHQrm727WdL5iXrw], [R], s[STARTED], reason [master
[prod-es-

r06][IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]]
marked shard

as started, but shard have not been created, mark shard as
failed]

[2012-01-16 02:52:41,880][WARN ][indices.cluster ]
[prod-es-

r03] [contact_documents-194054-1322678627][0] master
[[prod-es-r06]

[IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]]] marked
shard as

started, but shard have not been created, mark shard as failed
[2012-01-16 02:52:41,880][WARN ][cluster.action.shard ]
[prod-es-

r03] sending failed shard for
[contact_documents-194054-1322678627]

[0], node[zB6rqHbHQrm727WdL5iXrw], [R], s[STARTED], reason
[master

[prod-es-r06][IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300
]]

marked shard as started, but shard have not been created, mark
shard

as failed]
[2012-01-16 02:52:41,894][WARN ][indices.cluster ]
[prod-es-

r03] [contact_documents-527859-0][0] master
[[prod-es-r06][IfNWkYASSg-

TOZuMI7nj5w][inet[/10.180.46.203:9300]]] marked shard as
started, but

shard have not been created, mark shard as failed
[2012-01-16 02:52:41,894][WARN ][cluster.action.shard ]
[prod-es-

r03] sending failed shard for [contact_documents-527859-0][0],
node[zB6rqHbHQrm727WdL5iXrw], [R], s[STARTED], reason [master
[prod-es-

r06][IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]]
marked shard

as started, but shard have not been created, mark shard as
failed]

On Jan 16, 2:57 pm, Ævar Arnfjörð Bjarmason ava...@gmail.com
wrote:

You might want to try switching from multicast to unicast
just to

eliminate a variable.

Some networks don't treat multicast traffic very well.

It's also useful to look at the logs for the ES nodes during
these

outages. What do they say?


(Grant) #11

Done. Replied privately.

On Jan 18, 3:51 pm, Shay Banon kim...@gmail.com wrote:

Can you place it in dropbox under Public, and just send me a public link to
download it? Have problems with sharing folders.

On Wed, Jan 18, 2012 at 7:51 PM, Grant gr...@brewster.com wrote:

Shay: on dropbox, sent you an invite.

Thanks for any help,
-G

On Jan 17, 2:21 pm, Grant gr...@brewster.com wrote:

As an aside, after talking with our provider, while all our nodes are
on different physicals, six of the 8 exist in the same huddle, so they
share a switch. My suspicion is the switch was either rebooted or was
flapping.

On Jan 17, 12:58 pm, Grant gr...@brewster.com wrote:

I still have logs if you'd be interested in having a look. Let me grab
them...

On Jan 17, 12:28 pm, Shay Banon kim...@gmail.com wrote:

Then next time it happens, can you dropbox the logs of the nodes?

On Tue, Jan 17, 2012 at 2:29 PM, Grant gr...@brewster.com wrote:

Hi Shay!

Believe it or not, we already run with minimum master nodes set to
3...

On Jan 17, 4:51 am, Shay Banon kim...@gmail.com wrote:

I suggest you set discovery.zen.minimum_master_nodes to a higher
value,

in

your case, something like 2 or 3. Then, if a node looses
connection to

other nodes, it will not "form its own cluster", but will try
and rejoin

and forma cluster with that minimum specified.

On Mon, Jan 16, 2012 at 10:16 PM, Grant gr...@brewster.com
wrote:

We're using unicast now (Rackspace doesn't allow multicast
traffic).

Here's a sample of what's in the logs during the issues. This
kind of

things was steaming pretty much continuously:

[2012-01-16 02:52:41,711][WARN ][indices.cluster ]
[prod-es-

r03] [contact_documents-527859-0][0] master
[[prod-es-r06][IfNWkYASSg-

TOZuMI7nj5w][inet[/10.180.46.203:9300]]] marked shard as
started, but

shard have not been created, mark shard as failed
[2012-01-16 02:52:41,711][WARN ][cluster.action.shard ]
[prod-es-

r03] sending failed shard for [contact_documents-527859-0][0],
node[zB6rqHbHQrm727WdL5iXrw], [R], s[STARTED], reason [master
[prod-es-

r06][IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]]
marked shard

as started, but shard have not been created, mark shard as
failed]

[2012-01-16 02:52:41,880][WARN ][indices.cluster ]
[prod-es-

r03] [contact_documents-194054-1322678627][0] master
[[prod-es-r06]

[IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]]] marked
shard as

started, but shard have not been created, mark shard as failed
[2012-01-16 02:52:41,880][WARN ][cluster.action.shard ]
[prod-es-

r03] sending failed shard for
[contact_documents-194054-1322678627]

[0], node[zB6rqHbHQrm727WdL5iXrw], [R], s[STARTED], reason
[master

[prod-es-r06][IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300
]]

marked shard as started, but shard have not been created, mark
shard

as failed]
[2012-01-16 02:52:41,894][WARN ][indices.cluster ]
[prod-es-

r03] [contact_documents-527859-0][0] master
[[prod-es-r06][IfNWkYASSg-

TOZuMI7nj5w][inet[/10.180.46.203:9300]]] marked shard as
started, but

shard have not been created, mark shard as failed
[2012-01-16 02:52:41,894][WARN ][cluster.action.shard ]
[prod-es-

r03] sending failed shard for [contact_documents-527859-0][0],
node[zB6rqHbHQrm727WdL5iXrw], [R], s[STARTED], reason [master
[prod-es-

r06][IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]]
marked shard

as started, but shard have not been created, mark shard as
failed]

On Jan 16, 2:57 pm, Ævar Arnfjörð Bjarmason ava...@gmail.com
wrote:

You might want to try switching from multicast to unicast
just to

eliminate a variable.

Some networks don't treat multicast traffic very well.

It's also useful to look at the logs for the ES nodes during
these

outages. What do they say?


(MagmaRules) #12

Hi there,

I'm having a similar problem. The cluster node 3 "liegep03" failed with a
timeout for some reason. After a while it started to retry to connect and
ended up in this state: http://pastie.org/3761184 . I ended up with a 1.2GB
log file.

The server was restarted and then it entered a new state (in the pastie
under "Restart"). I ended up shutting down the server.

While the server was up the cluster remained in a yellow state. I shut it
down and the cluster recovered to green state.

Did you figure out what the reason was for your behaviour? I'm using
0.18.4. Has 0.19.2 improved something in this area?

On Friday, January 20, 2012 6:03:22 PM UTC, Grant wrote:

Done. Replied privately.

On Jan 18, 3:51 pm, Shay Banon kim...@gmail.com wrote:

Can you place it in dropbox under Public, and just send me a public link
to
download it? Have problems with sharing folders.

On Wed, Jan 18, 2012 at 7:51 PM, Grant gr...@brewster.com wrote:

Shay: on dropbox, sent you an invite.

Thanks for any help,
-G

On Jan 17, 2:21 pm, Grant gr...@brewster.com wrote:

As an aside, after talking with our provider, while all our nodes
are

on different physicals, six of the 8 exist in the same huddle, so
they

share a switch. My suspicion is the switch was either rebooted or
was

flapping.

On Jan 17, 12:58 pm, Grant gr...@brewster.com wrote:

I still have logs if you'd be interested in having a look. Let me
grab

them...

On Jan 17, 12:28 pm, Shay Banon kim...@gmail.com wrote:

Then next time it happens, can you dropbox the logs of the
nodes?

On Tue, Jan 17, 2012 at 2:29 PM, Grant gr...@brewster.com
wrote:

Hi Shay!

Believe it or not, we already run with minimum master nodes
set to

3...

On Jan 17, 4:51 am, Shay Banon kim...@gmail.com wrote:

I suggest you set discovery.zen.minimum_master_nodes to a
higher

value,

in

your case, something like 2 or 3. Then, if a node looses
connection to

other nodes, it will not "form its own cluster", but will
try

and rejoin

and forma cluster with that minimum specified.

On Mon, Jan 16, 2012 at 10:16 PM, Grant gr...@brewster.com

wrote:

We're using unicast now (Rackspace doesn't allow multicast
traffic).

Here's a sample of what's in the logs during the issues.
This

kind of

things was steaming pretty much continuously:

[2012-01-16 02:52:41,711][WARN ][indices.cluster
]

[prod-es-

r03] [contact_documents-527859-0][0] master
[[prod-es-r06][IfNWkYASSg-

TOZuMI7nj5w][inet[/10.180.46.203:9300]]] marked shard as
started, but

shard have not been created, mark shard as failed
[2012-01-16 02:52:41,711][WARN ][cluster.action.shard
]

[prod-es-

r03] sending failed shard for
[contact_documents-527859-0][0],

node[zB6rqHbHQrm727WdL5iXrw], [R], s[STARTED], reason
[master

[prod-es-

r06][IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]]
marked shard

as started, but shard have not been created, mark shard as
failed]

[2012-01-16 02:52:41,880][WARN ][indices.cluster
]

[prod-es-

r03] [contact_documents-194054-1322678627][0] master
[[prod-es-r06]

[IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]]]
marked

shard as

started, but shard have not been created, mark shard as
failed

[2012-01-16 02:52:41,880][WARN ][cluster.action.shard
]

[prod-es-

r03] sending failed shard for
[contact_documents-194054-1322678627]

[0], node[zB6rqHbHQrm727WdL5iXrw], [R], s[STARTED], reason
[master

[prod-es-r06][IfNWkYASSg-TOZuMI7nj5w][inet[/
10.180.46.203:9300

]]

marked shard as started, but shard have not been created,
mark

shard

as failed]
[2012-01-16 02:52:41,894][WARN ][indices.cluster
]

[prod-es-

r03] [contact_documents-527859-0][0] master
[[prod-es-r06][IfNWkYASSg-

TOZuMI7nj5w][inet[/10.180.46.203:9300]]] marked shard as
started, but

shard have not been created, mark shard as failed
[2012-01-16 02:52:41,894][WARN ][cluster.action.shard
]

[prod-es-

r03] sending failed shard for
[contact_documents-527859-0][0],

node[zB6rqHbHQrm727WdL5iXrw], [R], s[STARTED], reason
[master

[prod-es-

r06][IfNWkYASSg-TOZuMI7nj5w][inet[/10.180.46.203:9300]]
marked shard

as started, but shard have not been created, mark shard as
failed]

On Jan 16, 2:57 pm, Ævar Arnfjörð Bjarmason <
ava...@gmail.com>

wrote:

You might want to try switching from multicast to
unicast

just to

eliminate a variable.

Some networks don't treat multicast traffic very well.

It's also useful to look at the logs for the ES nodes
during

these

outages. What do they say?


(Shay Banon) #13

Yes, there have been improvements in this area in both later versions of
0.18 and 0.19. Are you configuring minimum master nodes?

On Tue, Apr 10, 2012 at 2:31 PM, MagmaRules mfcoxo@gmail.com wrote:

Hi there,

I'm having a similar problem. The cluster node 3 "liegep03" failed with a
timeout for some reason. After a while it started to retry to connect and
ended up in this state: http://pastie.org/3761184 . I ended up with a
1.2GB log file.

The server was restarted and then it entered a new state (in the pastie
under "Restart"). I ended up shutting down the server.

While the server was up the cluster remained in a yellow state. I shut it
down and the cluster recovered to green state.

Did you figure out what the reason was for your behaviour? I'm using
0.18.4. Has 0.19.2 improved something in this area?

On Friday, January 20, 2012 6:03:22 PM UTC, Grant wrote:

Done. Replied privately.

On Jan 18, 3:51 pm, Shay Banon kim...@gmail.com wrote:

Can you place it in dropbox under Public, and just send me a public
link to
download it? Have problems with sharing folders.

On Wed, Jan 18, 2012 at 7:51 PM, Grant gr...@brewster.com wrote:

Shay: on dropbox, sent you an invite.

Thanks for any help,
-G

On Jan 17, 2:21 pm, Grant gr...@brewster.com wrote:

As an aside, after talking with our provider, while all our nodes
are

on different physicals, six of the 8 exist in the same huddle, so
they

share a switch. My suspicion is the switch was either rebooted or
was

flapping.

On Jan 17, 12:58 pm, Grant gr...@brewster.com wrote:

I still have logs if you'd be interested in having a look. Let me
grab

them...

On Jan 17, 12:28 pm, Shay Banon kim...@gmail.com wrote:

Then next time it happens, can you dropbox the logs of the
nodes?

On Tue, Jan 17, 2012 at 2:29 PM, Grant gr...@brewster.com
wrote:

Hi Shay!

Believe it or not, we already run with minimum master nodes
set to

3...

On Jan 17, 4:51 am, Shay Banon kim...@gmail.com wrote:

I suggest you set discovery.zen.minimum_master_**nodes to
a higher

value,

in

your case, something like 2 or 3. Then, if a node looses
connection to

other nodes, it will not "form its own cluster", but will
try

and rejoin

and forma cluster with that minimum specified.

On Mon, Jan 16, 2012 at 10:16 PM, Grant gr...@brewster.com

wrote:

We're using unicast now (Rackspace doesn't allow
multicast

traffic).

Here's a sample of what's in the logs during the issues.
This

kind of

things was steaming pretty much continuously:

[2012-01-16 02:52:41,711][WARN ][indices.cluster
]

[prod-es-

r03] [contact_documents-527859-0][**0] master
[[prod-es-r06][IfNWkYASSg-

TOZuMI7nj5w][inet[/10.180.46.**203:9300]]] marked shard
as

started, but

shard have not been created, mark shard as failed
[2012-01-16 02:52:41,711][WARN ][cluster.action.shard
]

[prod-es-

r03] sending failed shard for
[contact_documents-527859-0][**0],

node[zB6rqHbHQrm727WdL5iXrw], [R], s[STARTED], reason
[master

[prod-es-

r06][IfNWkYASSg-TOZuMI7nj5w][**inet[/10.180.46.203:9300]]

marked shard

as started, but shard have not been created, mark shard
as

failed]

[2012-01-16 02:52:41,880][WARN ][indices.cluster
]

[prod-es-

r03] [contact_documents-194054-**1322678627][0] master
[[prod-es-r06]

[IfNWkYASSg-TOZuMI7nj5w][inet[**/10.180.46.203:9300]]]
marked

shard as

started, but shard have not been created, mark shard as
failed

[2012-01-16 02:52:41,880][WARN ][cluster.action.shard
]

[prod-es-

r03] sending failed shard for
[contact_documents-194054-**1322678627]

[0], node[zB6rqHbHQrm727WdL5iXrw], [R], s[STARTED],
reason

[master

[prod-es-r06][IfNWkYASSg-*TOZuMI7nj5w][inet[/10.180.46.
*203:9300 http://10.180.46.203:9300

]]

marked shard as started, but shard have not been created,
mark

shard

as failed]
[2012-01-16 02:52:41,894][WARN ][indices.cluster
]

[prod-es-

r03] [contact_documents-527859-0][**0] master
[[prod-es-r06][IfNWkYASSg-

TOZuMI7nj5w][inet[/10.180.46.**203:9300]]] marked shard
as

started, but

shard have not been created, mark shard as failed
[2012-01-16 02:52:41,894][WARN ][cluster.action.shard
]

[prod-es-

r03] sending failed shard for
[contact_documents-527859-0][**0],

node[zB6rqHbHQrm727WdL5iXrw], [R], s[STARTED], reason
[master

[prod-es-

r06][IfNWkYASSg-TOZuMI7nj5w][**inet[/10.180.46.203:9300]]

marked shard

as started, but shard have not been created, mark shard
as

failed]

On Jan 16, 2:57 pm, Ævar Arnfjörð Bjarmason <
ava...@gmail.com>

wrote:

You might want to try switching from multicast to
unicast

just to

eliminate a variable.

Some networks don't treat multicast traffic very well.

It's also useful to look at the logs for the ES nodes
during

these

outages. What do they say?


(sujoysett) #14

Hi,

Sorry to repeat, but we are facing the similar issue with ES version 0.19.3.

We had two nodes of elasticsearch on two different physical servers,
forming a single cluster. After a network failure, each node forms responds
separately, but they do not form a composite cluster unless restarted. We
followed Shay's suggestion and set the discovery.zen.minimum_master_nodes
to 2, and it cleanly solved the problem.

But, isn't setting minimum_master_nodes a direct contradiction to the
distributed concept of ES? After setting this, a network disturbance leads
disconnection of the nodes and makes the cluster state RED, which is the
last thing we want.

We want the nodes to respond even during network outage, at least locally,
and join again on recovery of network. Also, we want the setup to perform,
even if in YELLOW state, in case of a node failure (due to any other local
issue, heap space, whatever). We also intend to use commons-httpclient-*
failover* for redirecting the requests in case of failure of any single
node. And such requirements becomes impossible on setting
the minimum_master_nodes.

What is the solution to support both the requirements? Any suggestions?

Thanks in advance,
Sujoy.

On Wednesday, April 11, 2012 5:05:24 PM UTC+5:30, kimchy wrote:

Yes, there have been improvements in this area in both later versions of
0.18 and 0.19. Are you configuring minimum master nodes?

On Tue, Apr 10, 2012 at 2:31 PM, MagmaRules mfcoxo@gmail.com wrote:

Hi there,

I'm having a similar problem. The cluster node 3 "liegep03" failed with a
timeout for some reason. After a while it started to retry to connect and
ended up in this state: http://pastie.org/3761184 . I ended up with a
1.2GB log file.

The server was restarted and then it entered a new state (in the pastie
under "Restart"). I ended up shutting down the server.

While the server was up the cluster remained in a yellow state. I shut it
down and the cluster recovered to green state.

Did you figure out what the reason was for your behaviour? I'm using
0.18.4. Has 0.19.2 improved something in this area?

On Friday, January 20, 2012 6:03:22 PM UTC, Grant wrote:

Done. Replied privately.

On Jan 18, 3:51 pm, Shay Banon kim...@gmail.com wrote:

Can you place it in dropbox under Public, and just send me a public
link to
download it? Have problems with sharing folders.

On Wed, Jan 18, 2012 at 7:51 PM, Grant gr...@brewster.com wrote:

Shay: on dropbox, sent you an invite.

Thanks for any help,
-G

On Jan 17, 2:21 pm, Grant gr...@brewster.com wrote:

As an aside, after talking with our provider, while all our nodes
are

on different physicals, six of the 8 exist in the same huddle, so
they

share a switch. My suspicion is the switch was either rebooted or
was

flapping.

On Jan 17, 12:58 pm, Grant gr...@brewster.com wrote:

I still have logs if you'd be interested in having a look. Let
me grab

them...

On Jan 17, 12:28 pm, Shay Banon kim...@gmail.com wrote:

Then next time it happens, can you dropbox the logs of the
nodes?

On Tue, Jan 17, 2012 at 2:29 PM, Grant gr...@brewster.com
wrote:

Hi Shay!

Believe it or not, we already run with minimum master nodes
set to

3...

On Jan 17, 4:51 am, Shay Banon kim...@gmail.com wrote:

I suggest you set discovery.zen.minimum_master_**nodes to
a higher

value,

in

your case, something like 2 or 3. Then, if a node looses
connection to

other nodes, it will not "form its own cluster", but will
try

and rejoin

and forma cluster with that minimum specified.

On Mon, Jan 16, 2012 at 10:16 PM, Grant <
gr...@brewster.com>

wrote:

We're using unicast now (Rackspace doesn't allow
multicast

traffic).

Here's a sample of what's in the logs during the issues.
This

kind of

things was steaming pretty much continuously:

[2012-01-16 02:52:41,711][WARN ][indices.cluster
]

[prod-es-

r03] [contact_documents-527859-0][**0] master
[[prod-es-r06][IfNWkYASSg-

TOZuMI7nj5w][inet[/10.180.46.**203:9300]]] marked shard
as

started, but

shard have not been created, mark shard as failed
[2012-01-16 02:52:41,711][WARN ][cluster.action.shard
]

[prod-es-

r03] sending failed shard for
[contact_documents-527859-0][**0],

node[zB6rqHbHQrm727WdL5iXrw], [R], s[STARTED], reason
[master

[prod-es-

r06][IfNWkYASSg-TOZuMI7nj5w][**inet[/10.180.46.203:9300]]

marked shard

as started, but shard have not been created, mark shard
as

failed]

[2012-01-16 02:52:41,880][WARN ][indices.cluster
]

[prod-es-

r03] [contact_documents-194054-**1322678627][0] master
[[prod-es-r06]

[IfNWkYASSg-TOZuMI7nj5w][inet[**/10.180.46.203:9300]]]
marked

shard as

started, but shard have not been created, mark shard as
failed

[2012-01-16 02:52:41,880][WARN ][cluster.action.shard
]

[prod-es-

r03] sending failed shard for
[contact_documents-194054-**1322678627]

[0], node[zB6rqHbHQrm727WdL5iXrw], [R], s[STARTED],
reason

[master

[prod-es-r06][IfNWkYASSg-**TOZuMI7nj5w][inet[/10.180.46.
**203:9300 http://10.180.46.203:9300

]]

marked shard as started, but shard have not been
created, mark

shard

as failed]
[2012-01-16 02:52:41,894][WARN ][indices.cluster
]

[prod-es-

r03] [contact_documents-527859-0][**0] master
[[prod-es-r06][IfNWkYASSg-

TOZuMI7nj5w][inet[/10.180.46.**203:9300]]] marked shard
as

started, but

shard have not been created, mark shard as failed
[2012-01-16 02:52:41,894][WARN ][cluster.action.shard
]

[prod-es-

r03] sending failed shard for
[contact_documents-527859-0][**0],

node[zB6rqHbHQrm727WdL5iXrw], [R], s[STARTED], reason
[master

[prod-es-

r06][IfNWkYASSg-TOZuMI7nj5w][**inet[/10.180.46.203:9300]]

marked shard

as started, but shard have not been created, mark shard
as

failed]

On Jan 16, 2:57 pm, Ævar Arnfjörð Bjarmason <
ava...@gmail.com>

wrote:

You might want to try switching from multicast to
unicast

just to

eliminate a variable.

Some networks don't treat multicast traffic very well.

It's also useful to look at the logs for the ES nodes
during

these

outages. What do they say?


(system) #15