Node evicted itself?

Mohamed_Lrhazi · October 16, 2012, 6:04am

node5 in my 5 nodes cluster somehow left the party... its log says :

[root@log-i-4 ~]# tail -100 /data/elasticsearch/logs/elasticsearch.log
[2012-10-15 15:34:51,937][WARN ][monitor.jvm ] [ES5]
[gc][ParNew][92900][19930] duration [3.5m], collections [1]/[3.5m], total
[3.5m]/[6m], memory [807.5mb]->[754.3mb]/[1.9gb], all_pools {[Code Cache]
[6.6mb]->[6.6mb]/[48mb]}{[Par Eden Space] [58.9mb]->[2.3mb]/[66.5mb]}{[Par
Survivor Space] [8.3mb]->[8.3mb]/[8.3mb]}{[CMS Old Gen]
[740.2mb]->[743.6mb]/[1.9gb]}{[CMS Perm Gen] [35.5mb]->[35.5mb]/[82mb]}
[2012-10-15 15:34:52,502][INFO ][discovery.zen ] [ES5]
master_left [[ES1][JQHTswdQT4SDtBV-QYrU7w][inet[/10.212.19.10:9300]]],
reason [do not exists on master, act as master failure]
[2012-10-15 15:34:52,504][INFO ][cluster.service ] [ES5] master
{new [ES5][-a6YMTDKQvWwonzcZ_S4eQ][inet[/10.212.19.14:9300]], previous
[ES1][JQHTswdQT4SDtBV-QYrU7w][inet[/10.212.19.10:9300]]}, removed
{[ES1][JQHTswdQT4SDtBV-QYrU7w][inet[/10.212.19.10:9300]],}, reason:
zen-disco-master_failed
([ES1][JQHTswdQT4SDtBV-QYrU7w][inet[/10.212.19.10:9300]])

On the master all I see is:
[root@log-s-1 es]# tail -100 /data/elasticsearch/logs/elasticsearch.log
[2012-10-15 15:32:50,366][INFO ][cluster.service ] [ES1] removed
{[ES5][-a6YMTDKQvWwonzcZ_S4eQ][inet[/10.212.19.14:9300]],}, reason:
zen-disco-node_failed([ES5][-a6YMTDKQvWwonzcZ_S4eQ][inet[/10.212.19.14:9300]]),
reason failed to ping, tried [3] times, each with maximum [30s] timeout

What happened? and how do you put this node back ?

Thanks a lot,
Mohamed.

--

Tanguy1 · October 16, 2012, 9:00am

According to the logs:
your master 10.212.19.10 can't ping the node 10.212.19.14 and removed it
from the cluster
your node 10.212.19.14 can't ping the master and elected itself as a new
master

Since both nodes can't reach each other, you end up with 2 standalone
clusters. To avoid that check your network, configure
discovery.zen.minimum_master_nodes or set node.master to false on the
second node.

-- Tanguy
Twitter: @tlrx

Le mardi 16 octobre 2012 08:04:19 UTC+2, Mohamed Lrhazi a écrit :

node5 in my 5 nodes cluster somehow left the party... its log says :

[root@log-i-4 ~]# tail -100 /data/elasticsearch/logs/elasticsearch.log
[2012-10-15 15:34:51,937][WARN ][monitor.jvm ] [ES5]
[gc][ParNew][92900][19930] duration [3.5m], collections [1]/[3.5m], total
[3.5m]/[6m], memory [807.5mb]->[754.3mb]/[1.9gb], all_pools {[Code Cache]
[6.6mb]->[6.6mb]/[48mb]}{[Par Eden Space] [58.9mb]->[2.3mb]/[66.5mb]}{[Par
Survivor Space] [8.3mb]->[8.3mb]/[8.3mb]}{[CMS Old Gen]
[740.2mb]->[743.6mb]/[1.9gb]}{[CMS Perm Gen] [35.5mb]->[35.5mb]/[82mb]}
[2012-10-15 15:34:52,502][INFO ][discovery.zen ] [ES5]
master_left [[ES1][JQHTswdQT4SDtBV-QYrU7w][inet[/10.212.19.10:9300]]],
reason [do not exists on master, act as master failure]
[2012-10-15 15:34:52,504][INFO ][cluster.service ] [ES5] master
{new [ES5][-a6YMTDKQvWwonzcZ_S4eQ][inet[/10.212.19.14:9300]], previous
[ES1][JQHTswdQT4SDtBV-QYrU7w][inet[/10.212.19.10:9300]]}, removed
{[ES1][JQHTswdQT4SDtBV-QYrU7w][inet[/10.212.19.10:9300]],}, reason:
zen-disco-master_failed
([ES1][JQHTswdQT4SDtBV-QYrU7w][inet[/10.212.19.10:9300]])

On the master all I see is:
[root@log-s-1 es]# tail -100 /data/elasticsearch/logs/elasticsearch.log
[2012-10-15 15:32:50,366][INFO ][cluster.service ] [ES1] removed
{[ES5][-a6YMTDKQvWwonzcZ_S4eQ][inet[/10.212.19.14:9300]],}, reason:
zen-disco-node_failed([ES5][-a6YMTDKQvWwonzcZ_S4eQ][inet[/10.212.19.14:9300]]),
reason failed to ping, tried [3] times, each with maximum [30s] timeout

What happened? and how do you put this node back ?

Thanks a lot,
Mohamed.

--

Mohamed_Lrhazi · October 16, 2012, 11:51am

Thanks Tanguy.
These are VMs on the same virtual VLAN... hard to guess what a network
issue would be. iptables was not modified...

Right now, I can ping and telnet to port 9300, from both node1 to 5 and
from 5 to 1. Still, node 5 elected itself master of a cluster of four, not
including node1, and node1 does not seem to see node5.

On Tuesday, October 16, 2012 5:00:27 AM UTC-4, Tanguy wrote:

According to the logs:
your master 10.212.19.10 can't ping the node 10.212.19.14 and removed it
from the cluster
your node 10.212.19.14 can't ping the master and elected itself as a new
master

Since both nodes can't reach each other, you end up with 2 standalone
clusters. To avoid that check your network, configure
discovery.zen.minimum_master_nodes or set node.master to false on the
second node.

-- Tanguy
Twitter: @tlrx
tlrx (Tanguy Leroux) · GitHub

Le mardi 16 octobre 2012 08:04:19 UTC+2, Mohamed Lrhazi a écrit :

node5 in my 5 nodes cluster somehow left the party... its log says :

[root@log-i-4 ~]# tail -100 /data/elasticsearch/logs/elasticsearch.log
[2012-10-15 15:34:51,937][WARN ][monitor.jvm ] [ES5]
[gc][ParNew][92900][19930] duration [3.5m], collections [1]/[3.5m], total
[3.5m]/[6m], memory [807.5mb]->[754.3mb]/[1.9gb], all_pools {[Code Cache]
[6.6mb]->[6.6mb]/[48mb]}{[Par Eden Space] [58.9mb]->[2.3mb]/[66.5mb]}{[Par
Survivor Space] [8.3mb]->[8.3mb]/[8.3mb]}{[CMS Old Gen]
[740.2mb]->[743.6mb]/[1.9gb]}{[CMS Perm Gen] [35.5mb]->[35.5mb]/[82mb]}
[2012-10-15 15:34:52,502][INFO ][discovery.zen ] [ES5]
master_left [[ES1][JQHTswdQT4SDtBV-QYrU7w][inet[/10.212.19.10:9300]]],
reason [do not exists on master, act as master failure]
[2012-10-15 15:34:52,504][INFO ][cluster.service ] [ES5] master
{new [ES5][-a6YMTDKQvWwonzcZ_S4eQ][inet[/10.212.19.14:9300]], previous
[ES1][JQHTswdQT4SDtBV-QYrU7w][inet[/10.212.19.10:9300]]}, removed
{[ES1][JQHTswdQT4SDtBV-QYrU7w][inet[/10.212.19.10:9300]],}, reason:
zen-disco-master_failed
([ES1][JQHTswdQT4SDtBV-QYrU7w][inet[/10.212.19.10:9300]])

On the master all I see is:
[root@log-s-1 es]# tail -100 /data/elasticsearch/logs/elasticsearch.log
[2012-10-15 15:32:50,366][INFO ][cluster.service ] [ES1] removed
{[ES5][-a6YMTDKQvWwonzcZ_S4eQ][inet[/10.212.19.14:9300]],}, reason:
zen-disco-node_failed([ES5][-a6YMTDKQvWwonzcZ_S4eQ][inet[/10.212.19.14:9300]]),
reason failed to ping, tried [3] times, each with maximum [30s] timeout

What happened? and how do you put this node back ?

Thanks a lot,
Mohamed.

--

radu_gheorghe · October 16, 2012, 1:17pm

Hi Mohamed,

If you can ping and telnet, then your node should be able to join the
cluster if you use unicast discovery (see the ES configuration). By
default it uses multicast, and maybe your VLAN has trouble with that.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Tue, Oct 16, 2012 at 2:51 PM, Mohamed Lrhazi ml623@georgetown.edu wrote:

Thanks Tanguy.
These are VMs on the same virtual VLAN... hard to guess what a network issue
would be. iptables was not modified...

Right now, I can ping and telnet to port 9300, from both node1 to 5 and from
5 to 1. Still, node 5 elected itself master of a cluster of four, not
including node1, and node1 does not seem to see node5.

On Tuesday, October 16, 2012 5:00:27 AM UTC-4, Tanguy wrote:

According to the logs:
your master 10.212.19.10 can't ping the node 10.212.19.14 and removed it
from the cluster
your node 10.212.19.14 can't ping the master and elected itself as a new
master

Since both nodes can't reach each other, you end up with 2 standalone
clusters. To avoid that check your network, configure
discovery.zen.minimum_master_nodes or set node.master to false on the second
node.

-- Tanguy
Twitter: @tlrx
tlrx (Tanguy Leroux) · GitHub

Le mardi 16 octobre 2012 08:04:19 UTC+2, Mohamed Lrhazi a écrit :

node5 in my 5 nodes cluster somehow left the party... its log says :

[root@log-i-4 ~]# tail -100 /data/elasticsearch/logs/elasticsearch.log
[2012-10-15 15:34:51,937][WARN ][monitor.jvm ] [ES5]
[gc][ParNew][92900][19930] duration [3.5m], collections [1]/[3.5m], total
[3.5m]/[6m], memory [807.5mb]->[754.3mb]/[1.9gb], all_pools {[Code Cache]
[6.6mb]->[6.6mb]/[48mb]}{[Par Eden Space] [58.9mb]->[2.3mb]/[66.5mb]}{[Par
Survivor Space] [8.3mb]->[8.3mb]/[8.3mb]}{[CMS Old Gen]
[740.2mb]->[743.6mb]/[1.9gb]}{[CMS Perm Gen] [35.5mb]->[35.5mb]/[82mb]}
[2012-10-15 15:34:52,502][INFO ][discovery.zen ] [ES5]
master_left [[ES1][JQHTswdQT4SDtBV-QYrU7w][inet[/10.212.19.10:9300]]],
reason [do not exists on master, act as master failure]
[2012-10-15 15:34:52,504][INFO ][cluster.service ] [ES5] master
{new [ES5][-a6YMTDKQvWwonzcZ_S4eQ][inet[/10.212.19.14:9300]], previous
[ES1][JQHTswdQT4SDtBV-QYrU7w][inet[/10.212.19.10:9300]]}, removed
{[ES1][JQHTswdQT4SDtBV-QYrU7w][inet[/10.212.19.10:9300]],}, reason:
zen-disco-master_failed
([ES1][JQHTswdQT4SDtBV-QYrU7w][inet[/10.212.19.10:9300]])

On the master all I see is:
[root@log-s-1 es]# tail -100 /data/elasticsearch/logs/elasticsearch.log
[2012-10-15 15:32:50,366][INFO ][cluster.service ] [ES1] removed
{[ES5][-a6YMTDKQvWwonzcZ_S4eQ][inet[/10.212.19.14:9300]],}, reason:
zen-disco-node_failed([ES5][-a6YMTDKQvWwonzcZ_S4eQ][inet[/10.212.19.14:9300]]),
reason failed to ping, tried [3] times, each with maximum [30s] timeout

What happened? and how do you put this node back ?

Thanks a lot,
Mohamed.

--

--

Mohamed_Lrhazi · October 16, 2012, 1:24pm

Could mulicast work for when the cluster is first started and then later
not work?
I would imagine if the VLAN works for the first case, it should continue
to work for the latter case, no?

I now remember that this happened to me before, when the master did not
discover all other four nodes and had to shutdown everybody and start
over...

What VLAN configuration aspects would be relevant to IP multicast?

Thanks a lot,
Mohamed.

On Tuesday, October 16, 2012 9:17:43 AM UTC-4, Radu Gheorghe wrote:

Hi Mohamed,

If you can ping and telnet, then your node should be able to join the
cluster if you use unicast discovery (see the ES configuration). By
default it uses multicast, and maybe your VLAN has trouble with that.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Tue, Oct 16, 2012 at 2:51 PM, Mohamed Lrhazi <ml...@georgetown.edu<javascript:>>
wrote:

Thanks Tanguy.
These are VMs on the same virtual VLAN... hard to guess what a network
issue
would be. iptables was not modified...

Right now, I can ping and telnet to port 9300, from both node1 to 5 and
from
5 to 1. Still, node 5 elected itself master of a cluster of four, not
including node1, and node1 does not seem to see node5.

On Tuesday, October 16, 2012 5:00:27 AM UTC-4, Tanguy wrote:

According to the logs:
your master 10.212.19.10 can't ping the node 10.212.19.14 and removed
it
from the cluster
your node 10.212.19.14 can't ping the master and elected itself as a
new
master

Since both nodes can't reach each other, you end up with 2 standalone
clusters. To avoid that check your network, configure
discovery.zen.minimum_master_nodes or set node.master to false on the
second
node.

-- Tanguy
Twitter: @tlrx
tlrx (Tanguy Leroux) · GitHub

Le mardi 16 octobre 2012 08:04:19 UTC+2, Mohamed Lrhazi a écrit :

node5 in my 5 nodes cluster somehow left the party... its log says :

[root@log-i-4 ~]# tail -100 /data/elasticsearch/logs/elasticsearch.log
[2012-10-15 15:34:51,937][WARN ][monitor.jvm ] [ES5]
[gc][ParNew][92900][19930] duration [3.5m], collections [1]/[3.5m],
total
[3.5m]/[6m], memory [807.5mb]->[754.3mb]/[1.9gb], all_pools {[Code
Cache]
[6.6mb]->[6.6mb]/[48mb]}{[Par Eden Space]
[58.9mb]->[2.3mb]/[66.5mb]}{[Par
Survivor Space] [8.3mb]->[8.3mb]/[8.3mb]}{[CMS Old Gen]
[740.2mb]->[743.6mb]/[1.9gb]}{[CMS Perm Gen]
[35.5mb]->[35.5mb]/[82mb]}
[2012-10-15 15:34:52,502][INFO ][discovery.zen ] [ES5]
master_left [[ES1][JQHTswdQT4SDtBV-QYrU7w][inet[/10.212.19.10:9300]]],
reason [do not exists on master, act as master failure]
[2012-10-15 15:34:52,504][INFO ][cluster.service ] [ES5]
master
{new [ES5][-a6YMTDKQvWwonzcZ_S4eQ][inet[/10.212.19.14:9300]], previous
[ES1][JQHTswdQT4SDtBV-QYrU7w][inet[/10.212.19.10:9300]]}, removed
{[ES1][JQHTswdQT4SDtBV-QYrU7w][inet[/10.212.19.10:9300]],}, reason:
zen-disco-master_failed
([ES1][JQHTswdQT4SDtBV-QYrU7w][inet[/10.212.19.10:9300]])

On the master all I see is:
[root@log-s-1 es]# tail -100
/data/elasticsearch/logs/elasticsearch.log
[2012-10-15 15:32:50,366][INFO ][cluster.service ] [ES1]
removed
{[ES5][-a6YMTDKQvWwonzcZ_S4eQ][inet[/10.212.19.14:9300]],}, reason:

zen-disco-node_failed([ES5][-a6YMTDKQvWwonzcZ_S4eQ][inet[/10.212.19.14:9300]]),

reason failed to ping, tried [3] times, each with maximum [30s]
timeout

What happened? and how do you put this node back ?

Thanks a lot,
Mohamed.

--

--

Mohamed_Lrhazi · October 18, 2012, 3:01pm

So I restarted all my nodes, since my last post, I guess 2 days ago, my
cluster was back green, 5/5.
I had made one change to the config, in the hope it would prevent this
issue from reoccurring:

discovery.zen.minimum_master_nodes: 2

Now I find that my cluster is again down to 4/5... the master logged:

[2012-10-17 21:12:23,169][INFO ][cluster.service ] [ES1] removed
{[ES5][ZNj5YOQ0QS2Evo-A0WRfYQ][inet[/10.212.19.14:9300]],}, reason:
zen-disco-node_failed([ES5][ZNj5YOQ0QS2Evo-A0WRfYQ][inet[/10.212.19.14:9300]]),
reason failed to ping, tried [3] times, each with maximum [30s] timeout

And ES5 logged:

[2012-10-17 21:14:36,261][INFO ][discovery.zen ] [ES5]
master_left [[ES1][OqnGM7BMT3ywc8a3tzOdcQ][inet[/10.212.19.10:9300]]],
reason [do not exists on master, act as master failure]
[2012-10-17 21:14:36,283][INFO ][cluster.service ] [ES5] master
{new [ES4][625jd-2wTaGK2dEO00pRfg][inet[/10.212.19.13:9300]], previous
[ES1][OqnGM7BMT3ywc8a3tzOdcQ][inet[/10.212.19.10:9300]]}, removed
{[ES1][OqnGM7BMT3ywc8a3tzOdcQ][inet[/10.212.19.10:9300]],}, reason:
zen-disco-master_failed
([ES1][OqnGM7BMT3ywc8a3tzOdcQ][inet[/10.212.19.10:9300]])
[2012-10-17 21:14:37,306][INFO ][discovery.zen ] [ES5]
master_left [[ES4][625jd-2wTaGK2dEO00pRfg][inet[/10.212.19.13:9300]]],
reason [no longer master]
[2012-10-17 21:14:37,306][INFO ][cluster.service ] [ES5] master
{new [ES2][USWc7KTXRnCPwUmgQlZmcw][inet[/10.212.19.11:9300]], previous
[ES4][625jd-2wTaGK2dEO00pRfg][inet[/10.212.19.13:9300]]}, removed
{[ES4][625jd-2wTaGK2dEO00pRfg][inet[/10.212.19.13:9300]],}, reason:
zen-disco-master_failed
([ES4][625jd-2wTaGK2dEO00pRfg][inet[/10.212.19.13:9300]])
[2012-10-17 21:14:38,322][INFO ][discovery.zen ] [ES5]
master_left [[ES2][USWc7KTXRnCPwUmgQlZmcw][inet[/10.212.19.11:9300]]],
reason [no longer master]
[2012-10-17 21:14:38,323][INFO ][cluster.service ] [ES5] master
{new [ES5][ZNj5YOQ0QS2Evo-A0WRfYQ][inet[/10.212.19.14:9300]], previous
[ES2][USWc7KTXRnCPwUmgQlZmcw][inet[/10.212.19.11:9300]]}, removed
{[ES2][USWc7KTXRnCPwUmgQlZmcw][inet[/10.212.19.11:9300]],}, reason:
zen-disco-master_failed
([ES2][USWc7KTXRnCPwUmgQlZmcw][inet[/10.212.19.11:9300]])

What could be the problem?
How do I further debug?
I like multicast, do I really have to remove that?

Thanks a lot,
Mohamed.

On Tuesday, October 16, 2012 9:24:52 AM UTC-4, Mohamed Lrhazi wrote:

Could mulicast work for when the cluster is first started and then later
not work?
I would imagine if the VLAN works for the first case, it should continue
to work for the latter case, no?

I now remember that this happened to me before, when the master did not
discover all other four nodes and had to shutdown everybody and start
over...

What VLAN configuration aspects would be relevant to IP multicast?

Thanks a lot,
Mohamed.

On Tuesday, October 16, 2012 9:17:43 AM UTC-4, Radu Gheorghe wrote:

Hi Mohamed,

If you can ping and telnet, then your node should be able to join the
cluster if you use unicast discovery (see the ES configuration). By
default it uses multicast, and maybe your VLAN has trouble with that.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

On Tue, Oct 16, 2012 at 2:51 PM, Mohamed Lrhazi ml...@georgetown.edu
wrote:

Thanks Tanguy.
These are VMs on the same virtual VLAN... hard to guess what a network
issue
would be. iptables was not modified...

Right now, I can ping and telnet to port 9300, from both node1 to 5 and
from
5 to 1. Still, node 5 elected itself master of a cluster of four, not
including node1, and node1 does not seem to see node5.

On Tuesday, October 16, 2012 5:00:27 AM UTC-4, Tanguy wrote:

According to the logs:
your master 10.212.19.10 can't ping the node 10.212.19.14 and removed
it
from the cluster
your node 10.212.19.14 can't ping the master and elected itself as a
new
master

Since both nodes can't reach each other, you end up with 2 standalone
clusters. To avoid that check your network, configure
discovery.zen.minimum_master_nodes or set node.master to false on the
second
node.

-- Tanguy
Twitter: @tlrx
tlrx (Tanguy Leroux) · GitHub

Le mardi 16 octobre 2012 08:04:19 UTC+2, Mohamed Lrhazi a écrit :

node5 in my 5 nodes cluster somehow left the party... its log says :

[root@log-i-4 ~]# tail -100
/data/elasticsearch/logs/elasticsearch.log
[2012-10-15 15:34:51,937][WARN ][monitor.jvm ] [ES5]
[gc][ParNew][92900][19930] duration [3.5m], collections [1]/[3.5m],
total
[3.5m]/[6m], memory [807.5mb]->[754.3mb]/[1.9gb], all_pools {[Code
Cache]
[6.6mb]->[6.6mb]/[48mb]}{[Par Eden Space]
[58.9mb]->[2.3mb]/[66.5mb]}{[Par
Survivor Space] [8.3mb]->[8.3mb]/[8.3mb]}{[CMS Old Gen]
[740.2mb]->[743.6mb]/[1.9gb]}{[CMS Perm Gen]
[35.5mb]->[35.5mb]/[82mb]}
[2012-10-15 15:34:52,502][INFO ][discovery.zen ] [ES5]
master_left
[[ES1][JQHTswdQT4SDtBV-QYrU7w][inet[/10.212.19.10:9300]]],
reason [do not exists on master, act as master failure]
[2012-10-15 15:34:52,504][INFO ][cluster.service ] [ES5]
master
{new [ES5][-a6YMTDKQvWwonzcZ_S4eQ][inet[/10.212.19.14:9300]],
previous
[ES1][JQHTswdQT4SDtBV-QYrU7w][inet[/10.212.19.10:9300]]}, removed
{[ES1][JQHTswdQT4SDtBV-QYrU7w][inet[/10.212.19.10:9300]],}, reason:
zen-disco-master_failed
([ES1][JQHTswdQT4SDtBV-QYrU7w][inet[/10.212.19.10:9300]])

On the master all I see is:
[root@log-s-1 es]# tail -100
/data/elasticsearch/logs/elasticsearch.log
[2012-10-15 15:32:50,366][INFO ][cluster.service ] [ES1]
removed
{[ES5][-a6YMTDKQvWwonzcZ_S4eQ][inet[/10.212.19.14:9300]],}, reason:

zen-disco-node_failed([ES5][-a6YMTDKQvWwonzcZ_S4eQ][inet[/10.212.19.14:9300]]),

reason failed to ping, tried [3] times, each with maximum [30s]
timeout

What happened? and how do you put this node back ?

Thanks a lot,
Mohamed.

--

--

Node evicted itself?

Best regards, Radu

Best regards, Radu

Best regards, Radu

Best regards,
Radu

Best regards,
Radu

Best regards,
Radu