Cluster is broken

Hello,

I have deployed a 3 Node ElasticSearch cluster with Unicast discovery
and every two or three days the cluster is found split into two groups.
When I restart the service on all nodes everything is normal. I am using
0.19.8.

Here is my config.

Node IP - 178.238.237.239 - discovery.zen.ping.unicast.hosts: ["
178.238.237.241:9300"]
Node IP - 178.238.237.240 - discovery.zen.ping.unicast.hosts: ["
178.238.237.241:9300"]
Node IP - 178.238.237.241 - discovery.zen.ping.unicast.hosts: ["
178.238.237.241:9300"]

I am using a single node in the cluster as Unicast discovery node.

Kindly help.

Praveen

--

Hello!

The amount of information you provided is a bit low to actually see what is happening. Check your log files and see what is happening when nodes in your cluster split. If you can, provide that information and we will be able to help.

--

Regards,

Rafał Kuć

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

Hello,

I have deployed a 3 Node ElasticSearch cluster with Unicast discovery and every two or three days the cluster is found split into two groups. When I restart the service on all nodes everything is normal. I am using 0.19.8.

Here is my config.

Node IP - 178.238.237.239 - discovery.zen.ping.unicast.hosts: ["178.238.237.241:9300"]

Node IP - 178.238.237.240 - discovery.zen.ping.unicast.hosts: ["178.238.237.241:9300"]

Node IP - 178.238.237.241 - discovery.zen.ping.unicast.hosts: ["178.238.237.241:9300"]

I am using a single node in the cluster as Unicast discovery node.

Kindly help.

Praveen

--

Hello Rafal,

Here is the log from the three nodes. TES2 could not join back the cluster.

TES1:

[2012-08-21 21:10:57,668][INFO ][cluster.service ] [TES1] removed
{[TES2][bM_VnK8XRT-AnfslVn--Lg][inet[/178.238.237.239:9300]],}, reason:
zen-disco-node_failed([TES2][bM_VnK8XRT-AnfslVn--Lg][inet[/178.238.237.239:9300]]),
reason failed to ping, tried [3] times, each with maximum [30s] timeout

TES2:

[2012-08-21 21:11:04,392][INFO ][discovery.zen ] [TES2]
master_left [[TES1][OpmSifs_QeiChwguUzeU2A][inet[/178.238.237.241:9300]]],
reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2012-08-21 21:11:04,412][INFO ][cluster.service ] [TES2] master
{new [TES3][VG7KwKhEQ5u2cyug9hLJOw][inet[/178.238.237.240:9300]], previous
[TES1][OpmSifs_QeiChwguUzeU2A][inet[/178.238.237.241:9300]]}, removed
{[TES1][OpmSifs_QeiChwguUzeU2A][inet[/178.238.237.241:9300]],}, reason:
zen-disco-master_failed
([TES1][OpmSifs_QeiChwguUzeU2A][inet[/178.238.237.241:9300]])
[2012-08-21 21:11:05,596][INFO ][discovery.zen ] [TES2]
master_left [[TES3][VG7KwKhEQ5u2cyug9hLJOw][inet[/178.238.237.240:9300]]],
reason [no longer master]
[2012-08-21 21:11:05,597][INFO ][cluster.service ] [TES2] master
{new [TES2][bM_VnK8XRT-AnfslVn--Lg][inet[/178.238.237.239:9300]], previous
[TES3][VG7KwKhEQ5u2cyug9hLJOw][inet[/178.238.237.240:9300]]}, removed
{[TES3][VG7KwKhEQ5u2cyug9hLJOw][inet[/178.238.237.240:9300]],}, reason:
zen-disco-master_failed
([TES3][VG7KwKhEQ5u2cyug9hLJOw][inet[/178.238.237.240:9300]])

TES3:

[2012-08-21 21:11:17,016][INFO ][cluster.service ] [TES3] removed
{[TES2][bM_VnK8XRT-AnfslVn--Lg][inet[/178.238.237.239:9300]],}, reason:
zen-disco-receive(from master
[[TES1][OpmSifs_QeiChwguUzeU2A][inet[/178.238.237.241:9300]]])

Thank you.

Praveen

On Thu, Aug 23, 2012 at 2:20 AM, Rafał Kuć r.kuc@solr.pl wrote:

Hello!

The amount of information you provided is a bit low to actually see what
is happening. Check your log files and see what is happening when nodes in
your cluster split. If you can, provide that information and we will be
able to help.

*--
Regards,
Rafał Kuć
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch -
ElasticSearch

Hello,

I have deployed a 3 Node ElasticSearch cluster with Unicast discovery
and every two or three days the cluster is found split into two groups.
When I restart the service on all nodes everything is normal. I am using
0.19.8.

Here is my config.

Node IP - 178.238.237.239 - discovery.zen.ping.unicast.hosts: ["
178.238.237.241:9300"]
Node IP - 178.238.237.240 - discovery.zen.ping.unicast.hosts: ["
178.238.237.241:9300"]
Node IP - 178.238.237.241 - discovery.zen.ping.unicast.hosts: ["
178.238.237.241:9300
"]

I am using a single node in the cluster as Unicast discovery node.

Kindly help.

Praveen

--

--

Why don't you set all nodes to discover all others?

Something like:
For node 241, add unicast to 239 and 240
For node 240, add unicast to 239 and 241
For node 239, add unicast to 240 and 241

Perhaps, it could help to avoid split brain issues.

HTH

David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 22 août 2012 à 22:41, Praveen Baratam praveen.baratam@gmail.com a écrit :

Hello,

I have deployed a 3 Node ElasticSearch cluster with Unicast discovery and every two or three days the cluster is found split into two groups. When I restart the service on all nodes everything is normal. I am using 0.19.8.

Here is my config.

Node IP - 178.238.237.239 - discovery.zen.ping.unicast.hosts: ["178.238.237.241:9300"]
Node IP - 178.238.237.240 - discovery.zen.ping.unicast.hosts: ["178.238.237.241:9300"]
Node IP - 178.238.237.241 - discovery.zen.ping.unicast.hosts: ["178.238.237.241:9300"]

I am using a single node in the cluster as Unicast discovery node.

Kindly help.

Praveen

--

From the logs.....

It appears as if the node which is losing connection to others gives up
after one or two mins and doesnt retry to join the cluster again. Most
network partitions last atleast few mins and connectivity is regained soon
if not in 1 or 2 mins.

On Thu, Aug 23, 2012 at 2:32 AM, David Pilato david@pilato.fr wrote:

Why don't you set all nodes to discover all others?

Something like:
For node 241, add unicast to 239 and 240
For node 240, add unicast to 239 and 241
For node 239, add unicast to 240 and 241

Perhaps, it could help to avoid split brain issues.

HTH

David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 22 août 2012 à 22:41, Praveen Baratam praveen.baratam@gmail.com a
écrit :

Hello,

I have deployed a 3 Node ElasticSearch cluster with Unicast discovery
and every two or three days the cluster is found split into two groups.
When I restart the service on all nodes everything is normal. I am using
0.19.8.

Here is my config.

Node IP - 178.238.237.239 - discovery.zen.ping.unicast.hosts: ["
178.238.237.241:9300"]
Node IP - 178.238.237.240 - discovery.zen.ping.unicast.hosts: ["
178.238.237.241:9300"]
Node IP - 178.238.237.241 - discovery.zen.ping.unicast.hosts: ["
178.238.237.241:9300"]

I am using a single node in the cluster as Unicast discovery node.

Kindly help.

Praveen

--

--

--

On many Gossip based distributed apps, every node checks every N seconds to
see if the failed nodes are up so that they can rejoin the cluster and vice
versa.

Is such a mechanism available on ElasticSearch?

I am also running Cassandra and Riak on the same cluster. They function
normally and recover gracefully from network splits.

On Thu, Aug 23, 2012 at 2:49 AM, Praveen Baratam
praveen.baratam@gmail.comwrote:

From the logs.....

It appears as if the node which is losing connection to others gives up
after one or two mins and doesnt retry to join the cluster again. Most
network partitions last atleast few mins and connectivity is regained soon
if not in 1 or 2 mins.

On Thu, Aug 23, 2012 at 2:32 AM, David Pilato david@pilato.fr wrote:

Why don't you set all nodes to discover all others?

Something like:
For node 241, add unicast to 239 and 240
For node 240, add unicast to 239 and 241
For node 239, add unicast to 240 and 241

Perhaps, it could help to avoid split brain issues.

HTH

David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 22 août 2012 à 22:41, Praveen Baratam praveen.baratam@gmail.com a
écrit :

Hello,

I have deployed a 3 Node ElasticSearch cluster with Unicastdiscovery and every two or three days the cluster is found split into two
groups. When I restart the service on all nodes everything is normal. I am
using 0.19.8.

Here is my config.

Node IP - 178.238.237.239 - discovery.zen.ping.unicast.hosts: ["
178.238.237.241:9300"]
Node IP - 178.238.237.240 - discovery.zen.ping.unicast.hosts: ["
178.238.237.241:9300"]
Node IP - 178.238.237.241 - discovery.zen.ping.unicast.hosts: ["
178.238.237.241:9300"]

I am using a single node in the cluster as Unicast discovery node.

Kindly help.

Praveen

--

--

--

@David - I have modified the configuration so that each node has the other
two nodes for Unicast discovery.

I have even modified discovery.zen.minimum_master_nodes to* 2 *so that in
case of a network split and a node is left alone it will not function alone.

Will setting the discovery.zen.fd.ping_retries to more than 3 help?
What are the implications?

The end goal is to enable the cluster to recover after an arbitrary network
split.

Please help.

Praveen

On Thu, Aug 23, 2012 at 3:19 AM, Praveen Baratam
praveen.baratam@gmail.comwrote:

On many Gossip based distributed apps, every node checks every N seconds
to see if the failed nodes are up so that they can rejoin the cluster and
vice versa.

Is such a mechanism available on ElasticSearch?

I am also running Cassandra and Riak on the same cluster. They function
normally and recover gracefully from network splits.

On Thu, Aug 23, 2012 at 2:49 AM, Praveen Baratam <
praveen.baratam@gmail.com> wrote:

From the logs.....

It appears as if the node which is losing connection to others gives up
after one or two mins and doesnt retry to join the cluster again. Most
network partitions last atleast few mins and connectivity is regained soon
if not in 1 or 2 mins.

On Thu, Aug 23, 2012 at 2:32 AM, David Pilato david@pilato.fr wrote:

Why don't you set all nodes to discover all others?

Something like:
For node 241, add unicast to 239 and 240
For node 240, add unicast to 239 and 241
For node 239, add unicast to 240 and 241

Perhaps, it could help to avoid split brain issues.

HTH

David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 22 août 2012 à 22:41, Praveen Baratam praveen.baratam@gmail.com a
écrit :

Hello,

I have deployed a 3 Node ElasticSearch cluster with Unicastdiscovery and every two or three days the cluster is found split into two
groups. When I restart the service on all nodes everything is normal. I am
using 0.19.8.

Here is my config.

Node IP - 178.238.237.239 - discovery.zen.ping.unicast.hosts: ["
178.238.237.241:9300"]
Node IP - 178.238.237.240 - discovery.zen.ping.unicast.hosts: ["
178.238.237.241:9300"]
Node IP - 178.238.237.241 - discovery.zen.ping.unicast.hosts: ["
178.238.237.241:9300"]

I am using a single node in the cluster as Unicast discovery node.

Kindly help.

Praveen

--

--

--

Praveen Baratam wrote:

Hello Rafal,

Here is the log from the three nodes. TES2 could not join back the cluster.

[...]

I have deployed a 3 Node ElasticSearch cluster with Unicast discovery
and every two or three days the cluster is found split into two groups.
When I restart the service on all nodes everything is normal. I am using
0.19.8.

Here is my config.

Node IP - 178.238.237.239 - discovery.zen.ping.unicast.hosts: ["
178.238.237.241:9300"]
Node IP - 178.238.237.240 - discovery.zen.ping.unicast.hosts: ["
178.238.237.241:9300"]
Node IP - 178.238.237.241 - discovery.zen.ping.unicast.hosts: ["
178.238.237.241:9300
"]

This last one doesn't make sense (why have it discover itself?)
however it shouldn't hurt.

I am using a single node in the cluster as Unicast discovery node.

If you're not having problems immediately upon cluster startup then I
doubt it's your discovery settings.

Try turning on DEBUG logging so you can at least see when the pings
timeouts start. That can give you an idea whether you have a network
issue, or perhaps correlation with period of heavy indexing which can
cause netty to back up and not respond to pings. The default ping
setting requires very responsive nodes, so you may find you need to
up discovery.zen.ping_timeout to 30s or 60s.

You can also try setting discovery.zen.minimum_master_nodes to 2 to
see if it makes it less susceptible to split-brain.

http://www.elasticsearch.org/guide/reference/modules/discovery/zen.html

-Drew

--

The problems seems to have vanished after increasing minimum master nodes
to 2 and ping time out to 30s.

Moreover I have included the other two nodes in the cluster for each node
as Unicast peers as suggested by David.

Thank you.

On Thu, Aug 23, 2012 at 8:46 PM, Drew Raines aaraines@gmail.com wrote:

Praveen Baratam wrote:

Hello Rafal,

Here is the log from the three nodes. TES2 could not join back the
cluster.

[...]

I have deployed a 3 Node ElasticSearch cluster with Unicast
discovery

and every two or three days the cluster is found split into two groups.
When I restart the service on all nodes everything is normal. I am using
0.19.8.

Here is my config.

Node IP - 178.238.237.239 - discovery.zen.ping.unicast.hosts: ["
178.238.237.241:9300"]
Node IP - 178.238.237.240 - discovery.zen.ping.unicast.hosts: ["
178.238.237.241:9300"]
Node IP - 178.238.237.241 - discovery.zen.ping.unicast.hosts: ["
178.238.237.241:9300
"]

This last one doesn't make sense (why have it discover itself?)
however it shouldn't hurt.

I am using a single node in the cluster as Unicast discovery node.

If you're not having problems immediately upon cluster startup then I
doubt it's your discovery settings.

Try turning on DEBUG logging so you can at least see when the pings
timeouts start. That can give you an idea whether you have a network
issue, or perhaps correlation with period of heavy indexing which can
cause netty to back up and not respond to pings. The default ping
setting requires very responsive nodes, so you may find you need to
up discovery.zen.ping_timeout to 30s or 60s.

You can also try setting discovery.zen.minimum_master_nodes to 2 to
see if it makes it less susceptible to split-brain.

http://www.elasticsearch.org/guide/reference/modules/discovery/zen.html

-Drew

--

--

Hey David,

Can you please explain to me how unicast can avoid split brain issue? I am
about to deploy a cluster of ElasticSearch machines but have encountered
split brain issue few times in testing. I am wondering how to avoid it in
future.

Thanks!
Vinay

On Wednesday, August 22, 2012 2:02:41 PM UTC-7, David Pilato wrote:

Why don't you set all nodes to discover all others?

Something like:
For node 241, add unicast to 239 and 240
For node 240, add unicast to 239 and 241
For node 239, add unicast to 240 and 241

Perhaps, it could help to avoid split brain issues.

HTH

David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 22 août 2012 à 22:41, Praveen Baratam <praveen...@gmail.com<javascript:>>
a écrit :

Hello,

I have deployed a 3 Node ElasticSearch cluster with Unicast discovery
and every two or three days the cluster is found split into two groups.
When I restart the service on all nodes everything is normal. I am using
0.19.8.

Here is my config.

Node IP - 178.238.237.239 - discovery.zen.ping.unicast.hosts: ["
178.238.237.241:9300"]
Node IP - 178.238.237.240 - discovery.zen.ping.unicast.hosts: ["
178.238.237.241:9300"]
Node IP - 178.238.237.241 - discovery.zen.ping.unicast.hosts: ["
178.238.237.241:9300"]

I am using a single node in the cluster as Unicast discovery node.

Kindly help.

Praveen

--

--