Node not joining cluster on boot

Guillaume_Loetscher · March 13, 2014, 4:07pm

Nice try

On node 1 :
root@es_node1:~# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode
DEFAULT
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
184: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast
state UP mode DEFAULT qlen 1000
link/ether a2:f3:84:d1:43:64 brd ff:ff:ff:ff:ff:ff

On Node 2 :
root@es_node2:~# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode
DEFAULT
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
182: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast
state UP mode DEFAULT qlen 1000
link/ether 26:ac:8b:14:58:ac brd ff:ff:ff:ff:ff:ff

On host :

84: br1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
mode DEFAULT group default
link/ether fe:74:89:32:c8:30 brd ff:ff:ff:ff:ff:ff
183: veth9bZZVH: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
pfifo_fast master br1 state UP mode DEFAULT group default qlen 1000
link/ether fe:74:89:32:c8:30 brd ff:ff:ff:ff:ff:ff
185: vethIG4AFA: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
pfifo_fast master br1 state UP mode DEFAULT group default qlen 1000
link/ether fe:be:16:bf:53:4f brd ff:ff:ff:ff:ff:ff
187: veth_logstash: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
pfifo_fast master br1 state UP mode DEFAULT group default qlen 1000
link/ether fe:93:8a:74:89:5a brd ff:ff:ff:ff:ff:ff

Le jeudi 13 mars 2014 16:47:37 UTC+1, Jörg Prante a écrit :

Enter

ip addr show

or

ifconfig

and check if MULTICAST is configured on the interface.

Jörg

On Thu, Mar 13, 2014 at 4:29 PM, Guillaume Loetscher <ster...@gmail.com<javascript:>

wrote:

Definitely a multicast problem.

I've decided to switch to unicast, and I manage to shutdown any nodes
(elected master or not), and the remaining one is taking up the load
perfectly.

When the other node is getting back online, using the unicast discovery,
there's no problem, the elected master discovered another master node, and
add it in the cluster.

I don't know what was failing in my (virtual) network configuration, but
honestly, I cannot lost several more hours to point out where my mockup
failed for multicast.

Le jeudi 13 mars 2014 15:23:06 UTC+1, Guillaume Loetscher a écrit :

@Xiao Yu : nope, it's not working also.

@Clinton Gormley : Yes, just after the "no matching id" error, a telnet
from Node 1 to node 2 is possible, and I got a valid connection.

All, please reming that after such issue, if I manually stop the service
on Node 2, then restart it, it will manage to reach the cluster without a
problem.

I'm suspecting a "race condition" here, something like "Node 2 container
is booting so fast that the bridge is not ready to handle the multicast
packet, leading to a connection problem".

Le jeudi 13 mars 2014 14:00:40 UTC+1, Xiao Yu a écrit :

Total shot in the dark here but try taking the hashmark out of the node
names and see if that helps?

On Thursday, March 13, 2014 5:31:30 AM UTC-4, Guillaume Loetscher wrote:

Sure

Node # 1:
root@es_node1:~# grep -E '^[1]' /etc/elasticsearch/elasticsearch.yml
cluster.name: logstash
node.name: "Node ES #1"
node.master: true
node.data: true
index.number_of_shards: 2
index.number_of_replicas: 1
discovery.zen.ping.timeout: 10s

Node #2 :
root@es_node2:~# grep -E '^[2]' /etc/elasticsearch/elasticsearch.yml
cluster.name: logstash
node.name: "Node ES #2"
node.master: true
node.data: true
index.number_of_shards: 2
index.number_of_replicas: 1
discovery.zen.ping.timeout: 10s

Le jeudi 13 mars 2014 10:15:16 UTC+1, David Pilato a écrit :

did you set the same cluster name on both nodes?

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

Le 13 mars 2014 à 09:57:35, Guillaume Loetscher (ster...@gmail.com)
a écrit:

Hi,

First, thanks for the answers and remarks.

You are both right, the issue I'm currently facing leads to a
"split-brain" situation, where Node #1 & Node #2 are both master, and doing
their own life on their side. I'll see to change my configuration and the
number of node, in order to limit this situation (I already checked this
article talking about split-brain in EShttp://blog.trifork.com/2013/10/24/how-to-avoid-the-split-brain-problem-in-elasticsearch/
).

However, this split-brain situation is the result of the problem with
the discovery / broadcast, which is represented in the log of Node #2 here :
[2014-03-12 22:03:52,709][WARN ][discovery.zen.ping.multicast] [Node
ES #2] received ping response ping_response{target [[Node ES
#1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}],
master [[Node ES #1][LbMQazWXR9uB6Q7R2xLxGQ][in
et[/172.16.0.100:9300]]{master=true}], cluster_name[logstash]} with
no matching id [1]

So, the connectivity between Node #1 (which is the first one online,
and therefore master) and Node #2 is established, as the log on Node #2
clearly said "received ping response", but with an "ID that didn't match".

This is apparently why Node #2 didn't join the cluster on Node #1,
and this is this specific issue I want to resolve.

Thanks,

Le jeudi 13 mars 2014 07:03:35 UTC+1, David Pilato a écrit :

Bonjour

You should set min_master_nodes to 2. Although I'd recommend having
3 nodes instead of 2.

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 12 mars 2014 à 23:58, Guillaume Loetscher ster...@gmail.com a
écrit :

Hi,

I've begun to test Elasticsearch recently, on a little mockup I've
designed.

Currently, I'm running two nodes on two LXC (v0.9) containers. Those
containers are linked using veth to a bridge declared on the host.

When I start the first node, the cluster starts, but when I start
the second node a bit later, it seems to get some information from the
other node but it always ended with the same "no matchind id" error.

Here's what I'm doing :

I start the LXC container of the first node :
root@lada:~# date && lxc-start -n es_node1 -d
mercredi 12 mars 2014, 22:59:39 (UTC+0100)

I logon the node, check the log file :
[2014-03-12 21:59:41,927][INFO ][node ] [Node
ES #1] version[0.90.12], pid[1129], build[26feed7/2014-02-25T15:
38:23Z]
[2014-03-12 21:59:41,928][INFO ][node ] [Node ES #1]
initializing ...
[2014-03-12 21:59:41,944][INFO ][plugins ] [Node ES #1]
loaded , sites
[2014-03-12 21:59:47,262][INFO ][node ] [Node ES #1]
initialized
[2014-03-12 21:59:47,263][INFO ][node ] [Node ES #1]
starting ...
[2014-03-12 21:59:47,485][INFO ][transport ] [Node ES #1]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/
172.16.0.100:9300]}
[2014-03-12 21:59:57,573][INFO ][cluster.service ] [Node ES #1]
new_master [Node ES #1][LbMQazWXR9uB6Q7R2xLxGQ][
inet[/172.16.0.100:9300]]{master=true}, reason: zen-disco-join
(elected_as_master)
[2014-03-12 21:59:57,657][INFO ][discovery ] [Node ES #1]
logstash/LbMQazWXR9uB6Q7R2xLxGQ
[2014-03-12 21:59:57,733][INFO ][http ] [Node ES #1]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/
172.16.0.100:9200]}
[2014-03-12 21:59:57,735][INFO ][node ] [Node ES #1]
started
[2014-03-12 21:59:59,569][INFO ][gateway ] [Node ES #1]
recovered [2] indices into cluster_state

Then I start the second node :
root@lada:/var/lib/lxc/kibana# date && lxc-start -n es_node2 -d
mercredi 12 mars 2014, 23:02:59 (UTC+0100)

Logon on the second node, and open the log :
[2014-03-12 22:03:02,126][INFO ][node ] [Node
ES #2] version[0.90.12], pid[1128], build[26feed7/2014-02-25T15:
38:23Z]
[2014-03-12 22:03:02,127][INFO ][node ] [Node ES #2]
initializing ...
[2014-03-12 22:03:02,141][INFO ][plugins ] [Node ES #2]
loaded , sites
[2014-03-12 22:03:07,352][INFO ][node ] [Node ES #2]
initialized
[2014-03-12 22:03:07,352][INFO ][node ] [Node ES #2]
starting ...
[2014-03-12 22:03:07,557][INFO ][transport ] [Node ES #2]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/
172.16.0.101:9300]}
[2014-03-12 22:03:17,637][INFO ][cluster.service ] [Node ES #2]
new_master [Node ES #2][0nNCsZrFS6y95G1ld-v_rA][
inet[/172.16.0.101:9300]]{master=true}, reason: zen-disco-join
(elected_as_master)
[2014-03-12 22:03:17,718][INFO ][discovery ] [Node ES #2]
logstash/0nNCsZrFS6y95G1ld-v_rA
[2014-03-12 22:03:17,783][INFO ][http ] [Node ES #2]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/
172.16.0.101:9200]}
[2014-03-12 22:03:17,785][INFO ][node ] [Node ES #2]
started
[2014-03-12 22:03:19,550][INFO ][gateway ] [Node ES #2]
recovered [2] indices into cluster_state
[2014-03-12 22:03:52,709][WARN ][discovery.zen.ping.multicast] [Node
ES #2] received ping response ping_response{target [[Node ES
#1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}],
master [[Node ES #1][LbMQazWXR9uB6Q7R2xLxGQ][
inet[/172.16.0.100:9300]]{master=true}], cluster_name[logstash]}
with no matching id [1]

At that point, each node considered themselves as master.

Here's my configuration for each node (same for node 1, except the
node.name) :
cluster.name: logstash
node.name: "Node ES #2"
node.master: true
node.data: true
index.number_of_shards: 2
index.number_of_replicas: 1
discovery.zen.ping.timeout: 10s

The bridge on my host is setup to forward immediately every new
interfaces so I don't think the problem is here. Here's the bridge config :
auto br1
iface br1 inet static
address 172.16.0.254
netmask 255.255.255.0
bridge_ports regex veth_.*
bridge_spt off
bridge_maxwait 0<span style="color: #000;" cl

...

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/d86aec61-e851-48ac-a7fb-fae757f3eebe%
40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/d86aec61-e851-48ac-a7fb-fae757f3eebe%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/212eb684-a2c6-4901-a7a1-7c51ad1e656b%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/212eb684-a2c6-4901-a7a1-7c51ad1e656b%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/577d3510-0a8a-4bcb-a35f-346d360fcac3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

^# ↩︎
^# ↩︎

Topic		Replies	Views
Cluster is broken Elasticsearch	10	693	July 6, 2017
Split brain problem on multi-node-single-machine installation Elasticsearch	3	551	July 6, 2017
Node not join the cluster so what happen about the data? Elasticsearch	4	372	July 6, 2017
Split brain? Elasticsearch	8	565	July 6, 2017
Unexpected cluster state Elasticsearch	5	517	July 6, 2017

Node not joining cluster on boot

Related topics