Node not joining cluster on boot


(Guillaume Loetscher) #1

Hi,

I've begun to test Elasticsearch recently, on a little mockup I've designed.

Currently, I'm running two nodes on two LXC (v0.9) containers. Those
containers are linked using veth to a bridge declared on the host.

When I start the first node, the cluster starts, but when I start the
second node a bit later, it seems to get some information from the other
node but it always ended with the same "no matchind id" error.

Here's what I'm doing :

I start the LXC container of the first node :
root@lada:~# date && lxc-start -n es_node1 -d
mercredi 12 mars 2014, 22:59:39 (UTC+0100)

I logon the node, check the log file :
[2014-03-12 21:59:41,927][INFO ][node ] [Node ES #1]
version[0.90.12], pid[1129], build[26feed7/2014-02-25T15:38:23Z]
[2014-03-12 21:59:41,928][INFO ][node ] [Node ES #1]
initializing ...
[2014-03-12 21:59:41,944][INFO ][plugins ] [Node ES #1]
loaded [], sites []
[2014-03-12 21:59:47,262][INFO ][node ] [Node ES #1]
initialized
[2014-03-12 21:59:47,263][INFO ][node ] [Node ES #1]
starting ...
[2014-03-12 21:59:47,485][INFO ][transport ] [Node ES #1]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address
{inet[/172.16.0.100:9300]}
[2014-03-12 21:59:57,573][INFO ][cluster.service ] [Node ES #1]
new_master [Node ES
#1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}, reason:
zen-disco-join (elected_as_master)
[2014-03-12 21:59:57,657][INFO ][discovery ] [Node ES #1]
logstash/LbMQazWXR9uB6Q7R2xLxGQ
[2014-03-12 21:59:57,733][INFO ][http ] [Node ES #1]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address
{inet[/172.16.0.100:9200]}
[2014-03-12 21:59:57,735][INFO ][node ] [Node ES #1]
started
[2014-03-12 21:59:59,569][INFO ][gateway ] [Node ES #1]
recovered [2] indices into cluster_state

Then I start the second node :
root@lada:/var/lib/lxc/kibana# date && lxc-start -n es_node2 -d
mercredi 12 mars 2014, 23:02:59 (UTC+0100)

Logon on the second node, and open the log :
[2014-03-12 22:03:02,126][INFO ][node ] [Node ES #2]
version[0.90.12], pid[1128], build[26feed7/2014-02-25T15:38:23Z]
[2014-03-12 22:03:02,127][INFO ][node ] [Node ES #2]
initializing ...
[2014-03-12 22:03:02,141][INFO ][plugins ] [Node ES #2]
loaded [], sites []
[2014-03-12 22:03:07,352][INFO ][node ] [Node ES #2]
initialized
[2014-03-12 22:03:07,352][INFO ][node ] [Node ES #2]
starting ...
[2014-03-12 22:03:07,557][INFO ][transport ] [Node ES #2]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address
{inet[/172.16.0.101:9300]}
[2014-03-12 22:03:17,637][INFO ][cluster.service ] [Node ES #2]
new_master [Node ES
#2][0nNCsZrFS6y95G1ld-v_rA][inet[/172.16.0.101:9300]]{master=true}, reason:
zen-disco-join (elected_as_master)
[2014-03-12 22:03:17,718][INFO ][discovery ] [Node ES #2]
logstash/0nNCsZrFS6y95G1ld-v_rA
[2014-03-12 22:03:17,783][INFO ][http ] [Node ES #2]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address
{inet[/172.16.0.101:9200]}
[2014-03-12 22:03:17,785][INFO ][node ] [Node ES #2]
started
[2014-03-12 22:03:19,550][INFO ][gateway ] [Node ES #2]
recovered [2] indices into cluster_state
[2014-03-12 22:03:52,709][WARN ][discovery.zen.ping.multicast] [Node ES #2]
received ping response ping_response{target [[Node ES
#1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}], master
[[Node ES
#1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}],
cluster_name[logstash]} with no matching id [1]

At that point, each node considered themselves as master.

Here's my configuration for each node (same for node 1, except the
node.name) :
cluster.name: logstash
node.name: "Node ES #2"
node.master: true
node.data: true
index.number_of_shards: 2
index.number_of_replicas: 1
discovery.zen.ping.timeout: 10s

The bridge on my host is setup to forward immediately every new interfaces
so I don't think the problem is here. Here's the bridge config :
auto br1
iface br1 inet static
address 172.16.0.254
netmask 255.255.255.0
bridge_ports regex veth_.*
bridge_spt off
bridge_maxwait 0

The network configuration on each container is the same (IP aside). Here's
the node #1
root@es_node1:~# cat /etc/network/interfaces
auto lo
iface lo inet loopback

auto eth0
iface eth0 inet static
address 172.16.0.100
netmask 255.255.255.0
gateway 172.16.0.254

Node #2 is identical, except for IP 172.16.0.101

Elasticsearch version :
root@es_node1:~# dpkg -l elasticsearch
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-
pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-===================-==============-==============-============================================
ii elasticsearch 0.90.12 all Open Source,
Distributed, RESTful Search Eng

Host OS version :
root@lada:~# uname -a
Linux lada 3.12-1-amd64 #1 SMP Debian 3.12.6-2 (2013-12-29) x86_64 GNU/Linux
root@lada:~# cat /etc/debian_version
jessie/sid

LXC information :
root@lada:~# dpkg -l "lxc"
Souhait=inconnU/Installé/suppRimé/Purgé/H=à garder
| État=Non/Installé/fichier-Config/dépaqUeté/échec-conFig/H=semi-installé/W=
attend-traitement-déclenchements
|/ Err?=(aucune)/besoin Réinstallation (État,Err: majuscule=mauvais)
||/ Nom Version Architecture Description
+++-======================-================-================-==================================================
ii lxc 0.9.0~alpha3-2+d amd64 Linux
Containers userspace tools

LXC container OS : Debian stable 7.4

If I stop elasticsearch service on Node #2 then restart it, it manages to
join the cluster. However, having the node not joining the cluster at the
server reboot is a big problem for me, and is absolutely not normal.

Do someone have a clue on what's going on ?

Thanks a lot for your help.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b58fed0b-45ca-4eea-af9f-580b848011ac%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Xiao Yu) #2

Sounds like you have a standard split-brain problem, the best way to solve
this is to set discovery.zen.minimum_master_nodes to 2 for your cluster so
that both nodes must be up to elect a single master. This does mean your
cluster will not function with just 1 node.

On Wednesday, March 12, 2014 6:58:16 PM UTC-4, Guillaume Loetscher wrote:

Hi,

I've begun to test Elasticsearch recently, on a little mockup I've
designed.

Currently, I'm running two nodes on two LXC (v0.9) containers. Those
containers are linked using veth to a bridge declared on the host.

When I start the first node, the cluster starts, but when I start the
second node a bit later, it seems to get some information from the other
node but it always ended with the same "no matchind id" error.

Here's what I'm doing :

I start the LXC container of the first node :
root@lada:~# date && lxc-start -n es_node1 -d
mercredi 12 mars 2014, 22:59:39 (UTC+0100)

I logon the node, check the log file :
[2014-03-12 21:59:41,927][INFO ][node ] [Node ES #1]
version[0.90.12], pid[1129], build[26feed7/2014-02-25T15:38:23Z]
[2014-03-12 21:59:41,928][INFO ][node ] [Node ES #1]
initializing ...
[2014-03-12 21:59:41,944][INFO ][plugins ] [Node ES #1]
loaded [], sites []
[2014-03-12 21:59:47,262][INFO ][node ] [Node ES #1]
initialized
[2014-03-12 21:59:47,263][INFO ][node ] [Node ES #1]
starting ...
[2014-03-12 21:59:47,485][INFO ][transport ] [Node ES #1]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/
172.16.0.100:9300]}
[2014-03-12 21:59:57,573][INFO ][cluster.service ] [Node ES #1]
new_master [Node ES
#1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}, reason:
zen-disco-join (elected_as_master)
[2014-03-12 21:59:57,657][INFO ][discovery ] [Node ES #1]
logstash/LbMQazWXR9uB6Q7R2xLxGQ
[2014-03-12 21:59:57,733][INFO ][http ] [Node ES #1]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/
172.16.0.100:9200]}
[2014-03-12 21:59:57,735][INFO ][node ] [Node ES #1]
started
[2014-03-12 21:59:59,569][INFO ][gateway ] [Node ES #1]
recovered [2] indices into cluster_state

Then I start the second node :
root@lada:/var/lib/lxc/kibana# date && lxc-start -n es_node2 -d
mercredi 12 mars 2014, 23:02:59 (UTC+0100)

Logon on the second node, and open the log :
[2014-03-12 22:03:02,126][INFO ][node ] [Node ES #2]
version[0.90.12], pid[1128], build[26feed7/2014-02-25T15:38:23Z]
[2014-03-12 22:03:02,127][INFO ][node ] [Node ES #2]
initializing ...
[2014-03-12 22:03:02,141][INFO ][plugins ] [Node ES #2]
loaded [], sites []
[2014-03-12 22:03:07,352][INFO ][node ] [Node ES #2]
initialized
[2014-03-12 22:03:07,352][INFO ][node ] [Node ES #2]
starting ...
[2014-03-12 22:03:07,557][INFO ][transport ] [Node ES #2]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/
172.16.0.101:9300]}
[2014-03-12 22:03:17,637][INFO ][cluster.service ] [Node ES #2]
new_master [Node ES
#2][0nNCsZrFS6y95G1ld-v_rA][inet[/172.16.0.101:9300]]{master=true}, reason:
zen-disco-join (elected_as_master)
[2014-03-12 22:03:17,718][INFO ][discovery ] [Node ES #2]
logstash/0nNCsZrFS6y95G1ld-v_rA
[2014-03-12 22:03:17,783][INFO ][http ] [Node ES #2]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/
172.16.0.101:9200]}
[2014-03-12 22:03:17,785][INFO ][node ] [Node ES #2]
started
[2014-03-12 22:03:19,550][INFO ][gateway ] [Node ES #2]
recovered [2] indices into cluster_state
[2014-03-12 22:03:52,709][WARN ][discovery.zen.ping.multicast] [Node ES #2]
received ping response ping_response{target [[Node ES
#1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}], master
[[Node ES
#1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}],
cluster_name[logstash]} with no matching id [1]

At that point, each node considered themselves as master.

Here's my configuration for each node (same for node 1, except the
node.name) :
cluster.name: logstash
node.name: "Node ES #2"
node.master: true
node.data: true
index.number_of_shards: 2
index.number_of_replicas: 1
discovery.zen.ping.timeout: 10s

The bridge on my host is setup to forward immediately every new interfaces
so I don't think the problem is here. Here's the bridge config :
auto br1
iface br1 inet static
address 172.16.0.254
netmask 255.255.255.0
bridge_ports regex veth_.*
bridge_spt off
bridge_maxwait 0

The network configuration on each container is the same (IP aside). Here's
the node #1
root@es_node1:~<span
...

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/80be79a1-d93a-4379-8414-f3e3d6d40508%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(David Pilato) #3

Bonjour :slight_smile:

You should set min_master_nodes to 2. Although I'd recommend having 3 nodes instead of 2.

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 12 mars 2014 à 23:58, Guillaume Loetscher sterfield@gmail.com a écrit :

Hi,

I've begun to test Elasticsearch recently, on a little mockup I've designed.

Currently, I'm running two nodes on two LXC (v0.9) containers. Those containers are linked using veth to a bridge declared on the host.

When I start the first node, the cluster starts, but when I start the second node a bit later, it seems to get some information from the other node but it always ended with the same "no matchind id" error.

Here's what I'm doing :

I start the LXC container of the first node :
root@lada:~# date && lxc-start -n es_node1 -d
mercredi 12 mars 2014, 22:59:39 (UTC+0100)

I logon the node, check the log file :
[2014-03-12 21:59:41,927][INFO ][node ] [Node ES #1] version[0.90.12], pid[1129], build[26feed7/2014-02-25T15:38:23Z]
[2014-03-12 21:59:41,928][INFO ][node ] [Node ES #1] initializing ...
[2014-03-12 21:59:41,944][INFO ][plugins ] [Node ES #1] loaded [], sites []
[2014-03-12 21:59:47,262][INFO ][node ] [Node ES #1] initialized
[2014-03-12 21:59:47,263][INFO ][node ] [Node ES #1] starting ...
[2014-03-12 21:59:47,485][INFO ][transport ] [Node ES #1] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/172.16.0.100:9300]}
[2014-03-12 21:59:57,573][INFO ][cluster.service ] [Node ES #1] new_master [Node ES #1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}, reason: zen-disco-join (elected_as_master)
[2014-03-12 21:59:57,657][INFO ][discovery ] [Node ES #1] logstash/LbMQazWXR9uB6Q7R2xLxGQ
[2014-03-12 21:59:57,733][INFO ][http ] [Node ES #1] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/172.16.0.100:9200]}
[2014-03-12 21:59:57,735][INFO ][node ] [Node ES #1] started
[2014-03-12 21:59:59,569][INFO ][gateway ] [Node ES #1] recovered [2] indices into cluster_state

Then I start the second node :
root@lada:/var/lib/lxc/kibana# date && lxc-start -n es_node2 -d
mercredi 12 mars 2014, 23:02:59 (UTC+0100)

Logon on the second node, and open the log :
[2014-03-12 22:03:02,126][INFO ][node ] [Node ES #2] version[0.90.12], pid[1128], build[26feed7/2014-02-25T15:38:23Z]
[2014-03-12 22:03:02,127][INFO ][node ] [Node ES #2] initializing ...
[2014-03-12 22:03:02,141][INFO ][plugins ] [Node ES #2] loaded [], sites []
[2014-03-12 22:03:07,352][INFO ][node ] [Node ES #2] initialized
[2014-03-12 22:03:07,352][INFO ][node ] [Node ES #2] starting ...
[2014-03-12 22:03:07,557][INFO ][transport ] [Node ES #2] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/172.16.0.101:9300]}
[2014-03-12 22:03:17,637][INFO ][cluster.service ] [Node ES #2] new_master [Node ES #2][0nNCsZrFS6y95G1ld-v_rA][inet[/172.16.0.101:9300]]{master=true}, reason: zen-disco-join (elected_as_master)
[2014-03-12 22:03:17,718][INFO ][discovery ] [Node ES #2] logstash/0nNCsZrFS6y95G1ld-v_rA
[2014-03-12 22:03:17,783][INFO ][http ] [Node ES #2] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/172.16.0.101:9200]}
[2014-03-12 22:03:17,785][INFO ][node ] [Node ES #2] started
[2014-03-12 22:03:19,550][INFO ][gateway ] [Node ES #2] recovered [2] indices into cluster_state
[2014-03-12 22:03:52,709][WARN ][discovery.zen.ping.multicast] [Node ES #2] received ping response ping_response{target [[Node ES #1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}], master [[Node ES #1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}], cluster_name[logstash]} with no matching id [1]

At that point, each node considered themselves as master.

Here's my configuration for each node (same for node 1, except the node.name) :
cluster.name: logstash
node.name: "Node ES #2"
node.master: true
node.data: true
index.number_of_shards: 2
index.number_of_replicas: 1
discovery.zen.ping.timeout: 10s

The bridge on my host is setup to forward immediately every new interfaces so I don't think the problem is here. Here's the bridge config :
auto br1
iface br1 inet static
address 172.16.0.254
netmask 255.255.255.0
bridge_ports regex veth_.*
bridge_spt off
bridge_maxwait 0

The network configuration on each container is the same (IP aside). Here's the node #1
root@es_node1:~# cat /etc/network/interfaces
auto lo
iface lo inet loopback

auto eth0
iface eth0 inet static
address 172.16.0.100
netmask 255.255.255.0
gateway 172.16.0.254

Node #2 is identical, except for IP 172.16.0.101

Elasticsearch version :
root@es_node1:~# dpkg -l elasticsearch
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-===================-==============-==============-============================================
ii elasticsearch 0.90.12 all Open Source, Distributed, RESTful Search Eng

Host OS version :
root@lada:~# uname -a
Linux lada 3.12-1-amd64 #1 SMP Debian 3.12.6-2 (2013-12-29) x86_64 GNU/Linux
root@lada:~# cat /etc/debian_version
jessie/sid

LXC information :
root@lada:~# dpkg -l "lxc"
Souhait=inconnU/Installé/suppRimé/Purgé/H=à garder
| État=Non/Installé/fichier-Config/dépaqUeté/échec-conFig/H=semi-installé/W=attend-traitement-déclenchements
|/ Err?=(aucune)/besoin Réinstallation (État,Err: majuscule=mauvais)
||/ Nom Version Architecture Description
+++-======================-================-================-==================================================
ii lxc 0.9.0~alpha3-2+d amd64 Linux Containers userspace tools

LXC container OS : Debian stable 7.4

If I stop elasticsearch service on Node #2 then restart it, it manages to join the cluster. However, having the node not joining the cluster at the server reboot is a big problem for me, and is absolutely not normal.

Do someone have a clue on what's going on ?

Thanks a lot for your help.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b58fed0b-45ca-4eea-af9f-580b848011ac%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/D2C432E3-9121-4886-94A4-39CC9D110682%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.


(Guillaume Loetscher) #4

Hi,

First, thanks for the answers and remarks.

You are both right, the issue I'm currently facing leads to a "split-brain"
situation, where Node #1 & Node #2 are both master, and doing their own
life on their side. I'll see to change my configuration and the number of
node, in order to limit this situation (I already checked this article
talking about split-brain in EShttp://blog.trifork.com/2013/10/24/how-to-avoid-the-split-brain-problem-in-elasticsearch/
).

However, this split-brain situation is the result of the problem with the
discovery / broadcast, which is represented in the log of Node #2 here :
[2014-03-12 22:03:52,709][WARN ][discovery.zen.ping.multicast] [Node ES #2]
received ping response ping_response{target [[Node ES
#1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}], master
[[Node ES #1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}],
cluster_name[logstash]} with no matching id [1]

So, the connectivity between Node #1 (which is the first one online, and
therefore master) and Node #2 is established, as the log on Node #2 clearly
said "received ping response", but with an "ID that didn't match".

This is apparently why Node #2 didn't join the cluster on Node #1, and this
is this specific issue I want to resolve.

Thanks,

Le jeudi 13 mars 2014 07:03:35 UTC+1, David Pilato a écrit :

Bonjour :slight_smile:

You should set min_master_nodes to 2. Although I'd recommend having 3
nodes instead of 2.

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 12 mars 2014 à 23:58, Guillaume Loetscher <ster...@gmail.com<javascript:>>
a écrit :

Hi,

I've begun to test Elasticsearch recently, on a little mockup I've
designed.

Currently, I'm running two nodes on two LXC (v0.9) containers. Those
containers are linked using veth to a bridge declared on the host.

When I start the first node, the cluster starts, but when I start the
second node a bit later, it seems to get some information from the other
node but it always ended with the same "no matchind id" error.

Here's what I'm doing :

I start the LXC container of the first node :
root@lada:~# date && lxc-start -n es_node1 -d
mercredi 12 mars 2014, 22:59:39 (UTC+0100)

I logon the node, check the log file :
[2014-03-12 21:59:41,927][INFO ][node ] [Node ES #1]
version[0.90.12], pid[1129], build[26feed7/2014-02-25T15:38:23Z]
[2014-03-12 21:59:41,928][INFO ][node ] [Node ES #1]
initializing ...
[2014-03-12 21:59:41,944][INFO ][plugins ] [Node ES #1]
loaded [], sites []
[2014-03-12 21:59:47,262][INFO ][node ] [Node ES #1]
initialized
[2014-03-12 21:59:47,263][INFO ][node ] [Node ES #1]
starting ...
[2014-03-12 21:59:47,485][INFO ][transport ] [Node ES #1]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/
172.16.0.100:9300]}
[2014-03-12 21:59:57,573][INFO ][cluster.service ] [Node ES #1]
new_master [Node ES
#1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}, reason:
zen-disco-join (elected_as_master)
[2014-03-12 21:59:57,657][INFO ][discovery ] [Node ES #1]
logstash/LbMQazWXR9uB6Q7R2xLxGQ
[2014-03-12 21:59:57,733][INFO ][http ] [Node ES #1]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/
172.16.0.100:9200]}
[2014-03-12 21:59:57,735][INFO ][node ] [Node ES #1]
started
[2014-03-12 21:59:59,569][INFO ][gateway ] [Node ES #1]
recovered [2] indices into cluster_state

Then I start the second node :
root@lada:/var/lib/lxc/kibana# date && lxc-start -n es_node2 -d
mercredi 12 mars 2014, 23:02:59 (UTC+0100)

Logon on the second node, and open the log :
[2014-03-12 22:03:02,126][INFO ][node ] [Node ES #2]
version[0.90.12], pid[1128], build[26feed7/2014-02-25T15:38:23Z]
[2014-03-12 22:03:02,127][INFO ][node ] [Node ES #2]
initializing ...
[2014-03-12 22:03:02,141][INFO ][plugins ] [Node ES #2]
loaded [], sites []
[2014-03-12 22:03:07,352][INFO ][node ] [Node ES #2]
initialized
[2014-03-12 22:03:07,352][INFO ][node ] [Node ES #2]
starting ...
[2014-03-12 22:03:07,557][INFO ][transport ] [Node ES #2]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/
172.16.0.101:9300]}
[2014-03-12 22:03:17,637][INFO ][cluster.service ] [Node ES #2]
new_master [Node ES
#2][0nNCsZrFS6y95G1ld-v_rA][inet[/172.16.0.101:9300]]{master=true}, reason:
zen-disco-join (elected_as_master)
[2014-03-12 22:03:17,718][INFO ][discovery ] [Node ES #2]
logstash/0nNCsZrFS6y95G1ld-v_rA
[2014-03-12 22:03:17,783][INFO ][http ] [Node ES #2]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/
172.16.0.101:9200]}
[2014-03-12 22:03:17,785][INFO ][node ] [Node ES #2]
started
[2014-03-12 22:03:19,550][INFO ][gateway ] [Node ES #2]
recovered [2] indices into cluster_state
[2014-03-12 22:03:52,709][WARN ][discovery.zen.ping.multicast] [Node ES #2]
received ping response ping_response{target [[Node ES
#1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}], master
[[Node ES
#1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}],
cluster_name[logstash]} with no matching id [1]

At that point, each node considered themselves as master.

Here's my configuration for each node (same for node 1, except the
node.name) :
cluster.name: logstash
node.name: "Node ES #2"
node.master: true
node.data: true
index.number_of_shards: 2
index.number_of_replicas: 1
discovery.zen.ping.timeout: 10s

The bridge on my host is setup to forward immediately every new interfaces
so I don't think the problem is here. Here's the bridge config :
auto br1
iface br1 inet static
address 172.16.0.254
netmask 255.255.255.0
bridge_ports regex veth_.*
bridge_spt off
bridge_maxwait 0<span style="color: #000;" cl

...

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d86aec61-e851-48ac-a7fb-fae757f3eebe%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(David Pilato) #5

did you set the same cluster name on both nodes?

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr

Le 13 mars 2014 à 09:57:35, Guillaume Loetscher (sterfield@gmail.com) a écrit:

Hi,

First, thanks for the answers and remarks.

You are both right, the issue I'm currently facing leads to a "split-brain" situation, where Node #1 & Node #2 are both master, and doing their own life on their side. I'll see to change my configuration and the number of node, in order to limit this situation (I already checked this article talking about split-brain in ES).

However, this split-brain situation is the result of the problem with the discovery / broadcast, which is represented in the log of Node #2 here :
[2014-03-12 22:03:52,709][WARN ][discovery.zen.ping.multicast] [Node ES #2] received ping response ping_response{target [[Node ES #1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}], master [[Node ES #1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}], cluster_name[logstash]} with no matching id [1]

So, the connectivity between Node #1 (which is the first one online, and therefore master) and Node #2 is established, as the log on Node #2 clearly said "received ping response", but with an "ID that didn't match".

This is apparently why Node #2 didn't join the cluster on Node #1, and this is this specific issue I want to resolve.

Thanks,

Le jeudi 13 mars 2014 07:03:35 UTC+1, David Pilato a écrit :
Bonjour :slight_smile:

You should set min_master_nodes to 2. Although I'd recommend having 3 nodes instead of 2.

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 12 mars 2014 à 23:58, Guillaume Loetscher ster...@gmail.com a écrit :

Hi,

I've begun to test Elasticsearch recently, on a little mockup I've designed.

Currently, I'm running two nodes on two LXC (v0.9) containers. Those containers are linked using veth to a bridge declared on the host.

When I start the first node, the cluster starts, but when I start the second node a bit later, it seems to get some information from the other node but it always ended with the same "no matchind id" error.

Here's what I'm doing :

I start the LXC container of the first node :
root@lada:~# date && lxc-start -n es_node1 -d
mercredi 12 mars 2014, 22:59:39 (UTC+0100)

I logon the node, check the log file :
[2014-03-12 21:59:41,927][INFO ][node ] [Node ES #1] version[0.90.12], pid[1129], build[26feed7/2014-02-25T15:38:23Z]
[2014-03-12 21:59:41,928][INFO ][node ] [Node ES #1] initializing ...
[2014-03-12 21:59:41,944][INFO ][plugins ] [Node ES #1] loaded [], sites []
[2014-03-12 21:59:47,262][INFO ][node ] [Node ES #1] initialized
[2014-03-12 21:59:47,263][INFO ][node ] [Node ES #1] starting ...
[2014-03-12 21:59:47,485][INFO ][transport ] [Node ES #1] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/172.16.0.100:9300]}
[2014-03-12 21:59:57,573][INFO ][cluster.service ] [Node ES #1] new_master [Node ES #1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}, reason: zen-disco-join (elected_as_master)
[2014-03-12 21:59:57,657][INFO ][discovery ] [Node ES #1] logstash/LbMQazWXR9uB6Q7R2xLxGQ
[2014-03-12 21:59:57,733][INFO ][http ] [Node ES #1] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/172.16.0.100:9200]}
[2014-03-12 21:59:57,735][INFO ][node ] [Node ES #1] started
[2014-03-12 21:59:59,569][INFO ][gateway ] [Node ES #1] recovered [2] indices into cluster_state

Then I start the second node :
root@lada:/var/lib/lxc/kibana# date && lxc-start -n es_node2 -d
mercredi 12 mars 2014, 23:02:59 (UTC+0100)

Logon on the second node, and open the log :
[2014-03-12 22:03:02,126][INFO ][node ] [Node ES #2] version[0.90.12], pid[1128], build[26feed7/2014-02-25T15:38:23Z]
[2014-03-12 22:03:02,127][INFO ][node ] [Node ES #2] initializing ...
[2014-03-12 22:03:02,141][INFO ][plugins ] [Node ES #2] loaded [], sites []
[2014-03-12 22:03:07,352][INFO ][node ] [Node ES #2] initialized
[2014-03-12 22:03:07,352][INFO ][node ] [Node ES #2] starting ...
[2014-03-12 22:03:07,557][INFO ][transport ] [Node ES #2] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/172.16.0.101:9300]}
[2014-03-12 22:03:17,637][INFO ][cluster.service ] [Node ES #2] new_master [Node ES #2][0nNCsZrFS6y95G1ld-v_rA][inet[/172.16.0.101:9300]]{master=true}, reason: zen-disco-join (elected_as_master)
[2014-03-12 22:03:17,718][INFO ][discovery ] [Node ES #2] logstash/0nNCsZrFS6y95G1ld-v_rA
[2014-03-12 22:03:17,783][INFO ][http ] [Node ES #2] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/172.16.0.101:9200]}
[2014-03-12 22:03:17,785][INFO ][node ] [Node ES #2] started
[2014-03-12 22:03:19,550][INFO ][gateway ] [Node ES #2] recovered [2] indices into cluster_state
[2014-03-12 22:03:52,709][WARN ][discovery.zen.ping.multicast] [Node ES #2] received ping response ping_response{target [[Node ES #1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}], master [[Node ES #1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}], cluster_name[logstash]} with no matching id [1]

At that point, each node considered themselves as master.

Here's my configuration for each node (same for node 1, except the node.name) :
cluster.name: logstash
node.name: "Node ES #2"
node.master: true
node.data: true
index.number_of_shards: 2
index.number_of_replicas: 1
discovery.zen.ping.timeout: 10s

The bridge on my host is setup to forward immediately every new interfaces so I don't think the problem is here. Here's the bridge config :
auto br1
iface br1 inet static
address 172.16.0.254
netmask 255.255.255.0
bridge_ports regex veth_.*
bridge_spt off
bridge_maxwait 0<span style="color: #000;" cl
...

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d86aec61-e851-48ac-a7fb-fae757f3eebe%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/etPan.53217724.238e1f29.158d%40MacBook-Air-de-David.local.
For more options, visit https://groups.google.com/d/optout.


(Guillaume Loetscher) #6

Sure

Node # 1:
root@es_node1:~# grep -E '^[^#]' /etc/elasticsearch/elasticsearch.yml
cluster.name: logstash
node.name: "Node ES #1"
node.master: true
node.data: true
index.number_of_shards: 2
index.number_of_replicas: 1
discovery.zen.ping.timeout: 10s

Node #2 :
root@es_node2:~# grep -E '^[^#]' /etc/elasticsearch/elasticsearch.yml
cluster.name: logstash
node.name: "Node ES #2"
node.master: true
node.data: true
index.number_of_shards: 2
index.number_of_replicas: 1
discovery.zen.ping.timeout: 10s

Le jeudi 13 mars 2014 10:15:16 UTC+1, David Pilato a écrit :

did you set the same cluster name on both nodes?

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

Le 13 mars 2014 à 09:57:35, Guillaume Loetscher (ster...@gmail.com<javascript:>)
a écrit:

Hi,

First, thanks for the answers and remarks.

You are both right, the issue I'm currently facing leads to a
"split-brain" situation, where Node #1 & Node #2 are both master, and doing
their own life on their side. I'll see to change my configuration and the
number of node, in order to limit this situation (I already checked this
article talking about split-brain in EShttp://blog.trifork.com/2013/10/24/how-to-avoid-the-split-brain-problem-in-elasticsearch/
).

However, this split-brain situation is the result of the problem with the
discovery / broadcast, which is represented in the log of Node #2 here :
[2014-03-12 22:03:52,709][WARN ][discovery.zen.ping.multicast] [Node ES #2]
received ping response ping_response{target [[Node ES
#1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}],
master [[Node ES #1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}],
cluster_name[logstash]} with no matching id [1]

So, the connectivity between Node #1 (which is the first one online, and
therefore master) and Node #2 is established, as the log on Node #2 clearly
said "received ping response", but with an "ID that didn't match".

This is apparently why Node #2 didn't join the cluster on Node #1, and
this is this specific issue I want to resolve.

Thanks,

Le jeudi 13 mars 2014 07:03:35 UTC+1, David Pilato a écrit :

Bonjour :slight_smile:

You should set min_master_nodes to 2. Although I'd recommend having 3
nodes instead of 2.

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 12 mars 2014 à 23:58, Guillaume Loetscher ster...@gmail.com a
écrit :

Hi,

I've begun to test Elasticsearch recently, on a little mockup I've
designed.

Currently, I'm running two nodes on two LXC (v0.9) containers. Those
containers are linked using veth to a bridge declared on the host.

When I start the first node, the cluster starts, but when I start the
second node a bit later, it seems to get some information from the other
node but it always ended with the same "no matchind id" error.

Here's what I'm doing :

I start the LXC container of the first node :
root@lada:~# date && lxc-start -n es_node1 -d
mercredi 12 mars 2014, 22:59:39 (UTC+0100)

I logon the node, check the log file :
[2014-03-12 21:59:41,927][INFO ][node ] [Node ES #1]
version[0.90.12], pid[1129], build[26feed7/2014-02-25T15:38:23Z]
[2014-03-12 21:59:41,928][INFO ][node ] [Node ES #1]
initializing ...
[2014-03-12 21:59:41,944][INFO ][plugins ] [Node ES #1]
loaded [], sites []
[2014-03-12 21:59:47,262][INFO ][node ] [Node ES #1]
initialized
[2014-03-12 21:59:47,263][INFO ][node ] [Node ES #1]
starting ...
[2014-03-12 21:59:47,485][INFO ][transport ] [Node ES #1]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/
172.16.0.100:9300]}
[2014-03-12 21:59:57,573][INFO ][cluster.service ] [Node ES #1]
new_master [Node ES
#1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}, reason:
zen-disco-join (elected_as_master)
[2014-03-12 21:59:57,657][INFO ][discovery ] [Node ES #1]
logstash/LbMQazWXR9uB6Q7R2xLxGQ
[2014-03-12 21:59:57,733][INFO ][http ] [Node ES #1]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/
172.16.0.100:9200]}
[2014-03-12 21:59:57,735][INFO ][node ] [Node ES #1]
started
[2014-03-12 21:59:59,569][INFO ][gateway ] [Node ES #1]
recovered [2] indices into cluster_state

Then I start the second node :
root@lada:/var/lib/lxc/kibana# date && lxc-start -n es_node2 -d
mercredi 12 mars 2014, 23:02:59 (UTC+0100)

Logon on the second node, and open the log :
[2014-03-12 22:03:02,126][INFO ][node ] [Node ES #2]
version[0.90.12], pid[1128], build[26feed7/2014-02-25T15:38:23Z]
[2014-03-12 22:03:02,127][INFO ][node ] [Node ES #2]
initializing ...
[2014-03-12 22:03:02,141][INFO ][plugins ] [Node ES #2]
loaded [], sites []
[2014-03-12 22:03:07,352][INFO ][node ] [Node ES #2]
initialized
[2014-03-12 22:03:07,352][INFO ][node ] [Node ES #2]
starting ...
[2014-03-12 22:03:07,557][INFO ][transport ] [Node ES #2]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/
172.16.0.101:9300]}
[2014-03-12 22:03:17,637][INFO ][cluster.service ] [Node ES #2]
new_master [Node ES
#2][0nNCsZrFS6y95G1ld-v_rA][inet[/172.16.0.101:9300]]{master=true}, reason:
zen-disco-join (elected_as_master)
[2014-03-12 22:03:17,718][INFO ][discovery ] [Node ES #2]
logstash/0nNCsZrFS6y95G1ld-v_rA
[2014-03-12 22:03:17,783][INFO ][http ] [Node ES #2]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/
172.16.0.101:9200]}
[2014-03-12 22:03:17,785][INFO ][node ] [Node ES #2]
started
[2014-03-12 22:03:19,550][INFO ][gateway ] [Node ES #2]
recovered [2] indices into cluster_state
[2014-03-12 22:03:52,709][WARN ][discovery.zen.ping.multicast] [Node ES #2]
received ping response ping_response{target [[Node ES
#1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}], master
[[Node ES
#1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}],
cluster_name[logstash]} with no matching id [1]

At that point, each node considered themselves as master.

Here's my configuration for each node (same for node 1, except the
node.name) :
cluster.name: logstash
node.name: "Node ES #2"
node.master: true
node.data: true
index.number_of_shards: 2
index.number_of_replicas: 1
discovery.zen.ping.timeout: 10s

The bridge on my host is setup to forward immediately every new
interfaces so I don't think the problem is here. Here's the bridge config :
auto br1
iface br1 inet static
address 172.16.0.254
netmask 255.255.255.0
bridge_ports regex veth_.*
bridge_spt off
bridge_maxwait 0<span style="color: #000;" cl

...

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/d86aec61-e851-48ac-a7fb-fae757f3eebe%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/d86aec61-e851-48ac-a7fb-fae757f3eebe%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0f25cfac-3f1c-4cc8-bbc0-88401d10c433%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Clinton Gormley) #7

Can you telnet from each box to port 9300 on the other box?

Does your bridge support multicast? If not, you could use unicast instead.

clint

On 13 March 2014 10:31, Guillaume Loetscher sterfield@gmail.com wrote:

Sure

Node # 1:
root@es_node1:~# grep -E '^[^#]' /etc/elasticsearch/elasticsearch.yml
cluster.name: logstash
node.name: "Node ES #1"

node.master: true
node.data: true
index.number_of_shards: 2
index.number_of_replicas: 1
discovery.zen.ping.timeout: 10s

Node #2 :
root@es_node2:~# grep -E '^[^#]' /etc/elasticsearch/elasticsearch.yml

cluster.name: logstash
node.name: "Node ES #2"
node.master: true
node.data: true
index.number_of_shards: 2
index.number_of_replicas: 1
discovery.zen.ping.timeout: 10s

Le jeudi 13 mars 2014 10:15:16 UTC+1, David Pilato a écrit :

did you set the same cluster name on both nodes?

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

Le 13 mars 2014 à 09:57:35, Guillaume Loetscher (ster...@gmail.com) a
écrit:

Hi,

First, thanks for the answers and remarks.

You are both right, the issue I'm currently facing leads to a
"split-brain" situation, where Node #1 & Node #2 are both master, and doing
their own life on their side. I'll see to change my configuration and the
number of node, in order to limit this situation (I already checked this
article talking about split-brain in EShttp://blog.trifork.com/2013/10/24/how-to-avoid-the-split-brain-problem-in-elasticsearch/
).

However, this split-brain situation is the result of the problem with the
discovery / broadcast, which is represented in the log of Node #2 here :
[2014-03-12 22:03:52,709][WARN ][discovery.zen.ping.multicast] [Node ES #2]
received ping response ping_response{target [[Node ES
#1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}],
master [[Node ES #1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{
master=true}], cluster_name[logstash]} with no matching id [1]

So, the connectivity between Node #1 (which is the first one online, and
therefore master) and Node #2 is established, as the log on Node #2 clearly
said "received ping response", but with an "ID that didn't match".

This is apparently why Node #2 didn't join the cluster on Node #1, and
this is this specific issue I want to resolve.

Thanks,

Le jeudi 13 mars 2014 07:03:35 UTC+1, David Pilato a écrit :

Bonjour :slight_smile:

You should set min_master_nodes to 2. Although I'd recommend having 3
nodes instead of 2.

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 12 mars 2014 à 23:58, Guillaume Loetscher ster...@gmail.com a
écrit :

Hi,

I've begun to test Elasticsearch recently, on a little mockup I've
designed.

Currently, I'm running two nodes on two LXC (v0.9) containers. Those
containers are linked using veth to a bridge declared on the host.

When I start the first node, the cluster starts, but when I start the
second node a bit later, it seems to get some information from the other
node but it always ended with the same "no matchind id" error.

Here's what I'm doing :

I start the LXC container of the first node :
root@lada:~# date && lxc-start -n es_node1 -d
mercredi 12 mars 2014, 22:59:39 (UTC+0100)

I logon the node, check the log file :
[2014-03-12 21:59:41,927][INFO ][node ] [Node ES #1]
version[0.90.12], pid[1129], build[26feed7/2014-02-25T15:38:23Z]
[2014-03-12 21:59:41,928][INFO ][node ] [Node ES #1]
initializing ...
[2014-03-12 21:59:41,944][INFO ][plugins ] [Node ES #1]
loaded [], sites []
[2014-03-12 21:59:47,262][INFO ][node ] [Node ES #1]
initialized
[2014-03-12 21:59:47,263][INFO ][node ] [Node ES #1]
starting ...
[2014-03-12 21:59:47,485][INFO ][transport ] [Node ES #1]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/
172.16.0.100:9300]}
[2014-03-12 21:59:57,573][INFO ][cluster.service ] [Node ES #1]
new_master [Node ES #1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300
]]{master=true}, reason: zen-disco-join (elected_as_master)
[2014-03-12 21:59:57,657][INFO ][discovery ] [Node ES #1]
logstash/LbMQazWXR9uB6Q7R2xLxGQ
[2014-03-12 21:59:57,733][INFO ][http ] [Node ES #1]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/
172.16.0.100:9200]}
[2014-03-12 21:59:57,735][INFO ][node ] [Node ES #1]
started
[2014-03-12 21:59:59,569][INFO ][gateway ] [Node ES #1]
recovered [2] indices into cluster_state

Then I start the second node :
root@lada:/var/lib/lxc/kibana# date && lxc-start -n es_node2 -d
mercredi 12 mars 2014, 23:02:59 (UTC+0100)

Logon on the second node, and open the log :
[2014-03-12 22:03:02,126][INFO ][node ] [Node ES #2]
version[0.90.12], pid[1128], build[26feed7/2014-02-25T15:38:23Z]
[2014-03-12 22:03:02,127][INFO ][node ] [Node ES #2]
initializing ...
[2014-03-12 22:03:02,141][INFO ][plugins ] [Node ES #2]
loaded [], sites []
[2014-03-12 22:03:07,352][INFO ][node ] [Node ES #2]
initialized
[2014-03-12 22:03:07,352][INFO ][node ] [Node ES #2]
starting ...
[2014-03-12 22:03:07,557][INFO ][transport ] [Node ES #2]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/
172.16.0.101:9300]}
[2014-03-12 22:03:17,637][INFO ][cluster.service ] [Node ES #2]
new_master [Node ES #2][0nNCsZrFS6y95G1ld-v_rA][inet[/172.16.0.101:9300
]]{master=true}, reason: zen-disco-join (elected_as_master)
[2014-03-12 22:03:17,718][INFO ][discovery ] [Node ES #2]
logstash/0nNCsZrFS6y95G1ld-v_rA
[2014-03-12 22:03:17,783][INFO ][http ] [Node ES #2]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/
172.16.0.101:9200]}
[2014-03-12 22:03:17,785][INFO ][node ] [Node ES #2]
started
[2014-03-12 22:03:19,550][INFO ][gateway ] [Node ES #2]
recovered [2] indices into cluster_state
[2014-03-12 22:03:52,709][WARN ][discovery.zen.ping.multicast] [Node ES #2]
received ping response ping_response{target [[Node ES
#1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}],
master [[Node ES #1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}],
cluster_name[logstash]} with no matching id [1]

At that point, each node considered themselves as master.

Here's my configuration for each node (same for node 1, except the
node.name) :
cluster.name: logstash
node.name: "Node ES #2"
node.master: true
node.data: true
index.number_of_shards: 2
index.number_of_replicas: 1
discovery.zen.ping.timeout: 10s

The bridge on my host is setup to forward immediately every new
interfaces so I don't think the problem is here. Here's the bridge config :
auto br1
iface br1 inet static
address 172.16.0.254
netmask 255.255.255.0
bridge_ports regex veth_.*
bridge_spt off
bridge_maxwait 0<span style="color: #000;" cl

...

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/d86aec61-e851-48ac-a7fb-fae757f3eebe%
40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/d86aec61-e851-48ac-a7fb-fae757f3eebe%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0f25cfac-3f1c-4cc8-bbc0-88401d10c433%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/0f25cfac-3f1c-4cc8-bbc0-88401d10c433%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPt3XKR0GT6SSB2RXZVbf-7M%2BFk736UifxPVcnaVPa%2BFoVym%2BA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Xiao Yu) #8

Total shot in the dark here but try taking the hashmark out of the node
names and see if that helps?

On Thursday, March 13, 2014 5:31:30 AM UTC-4, Guillaume Loetscher wrote:

Sure

Node # 1:
root@es_node1:~# grep -E '^[^#]' /etc/elasticsearch/elasticsearch.yml
cluster.name: logstash
node.name: "Node ES #1"
node.master: true
node.data: true
index.number_of_shards: 2
index.number_of_replicas: 1
discovery.zen.ping.timeout: 10s

Node #2 :
root@es_node2:~# grep -E '^[^#]' /etc/elasticsearch/elasticsearch.yml
cluster.name: logstash
node.name: "Node ES #2"
node.master: true
node.data: true
index.number_of_shards: 2
index.number_of_replicas: 1
discovery.zen.ping.timeout: 10s

Le jeudi 13 mars 2014 10:15:16 UTC+1, David Pilato a écrit :

did you set the same cluster name on both nodes?

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

Le 13 mars 2014 à 09:57:35, Guillaume Loetscher (ster...@gmail.com) a
écrit:

Hi,

First, thanks for the answers and remarks.

You are both right, the issue I'm currently facing leads to a
"split-brain" situation, where Node #1 & Node #2 are both master, and doing
their own life on their side. I'll see to change my configuration and the
number of node, in order to limit this situation (I already checked this
article talking about split-brain in EShttp://blog.trifork.com/2013/10/24/how-to-avoid-the-split-brain-problem-in-elasticsearch/
).

However, this split-brain situation is the result of the problem with the
discovery / broadcast, which is represented in the log of Node #2 here :
[2014-03-12 22:03:52,709][WARN ][discovery.zen.ping.multicast] [Node ES #2]
received ping response ping_response{target [[Node ES
#1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}],
master [[Node ES #1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}],
cluster_name[logstash]} with no matching id [1]

So, the connectivity between Node #1 (which is the first one online, and
therefore master) and Node #2 is established, as the log on Node #2 clearly
said "received ping response", but with an "ID that didn't match".

This is apparently why Node #2 didn't join the cluster on Node #1, and
this is this specific issue I want to resolve.

Thanks,

Le jeudi 13 mars 2014 07:03:35 UTC+1, David Pilato a écrit :

Bonjour :slight_smile:

You should set min_master_nodes to 2. Although I'd recommend having 3
nodes instead of 2.

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 12 mars 2014 à 23:58, Guillaume Loetscher ster...@gmail.com a
écrit :

Hi,

I've begun to test Elasticsearch recently, on a little mockup I've
designed.

Currently, I'm running two nodes on two LXC (v0.9) containers. Those
containers are linked using veth to a bridge declared on the host.

When I start the first node, the cluster starts, but when I start the
second node a bit later, it seems to get some information from the other
node but it always ended with the same "no matchind id" error.

Here's what I'm doing :

I start the LXC container of the first node :
root@lada:~# date && lxc-start -n es_node1 -d
mercredi 12 mars 2014, 22:59:39 (UTC+0100)

I logon the node, check the log file :
[2014-03-12 21:59:41,927][INFO ][node ] [Node ES #1]
version[0.90.12], pid[1129], build[26feed7/2014-02-25T15:38:23Z]
[2014-03-12 21:59:41,928][INFO ][node ] [Node ES #1]
initializing ...
[2014-03-12 21:59:41,944][INFO ][plugins ] [Node ES #1]
loaded [], sites []
[2014-03-12 21:59:47,262][INFO ][node ] [Node ES #1]
initialized
[2014-03-12 21:59:47,263][INFO ][node ] [Node ES #1]
starting ...
[2014-03-12 21:59:47,485][INFO ][transport ] [Node ES #1]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/
172.16.0.100:9300]}
[2014-03-12 21:59:57,573][INFO ][cluster.service ] [Node ES #1]
new_master [Node ES
#1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}, reason:
zen-disco-join (elected_as_master)
[2014-03-12 21:59:57,657][INFO ][discovery ] [Node ES #1]
logstash/LbMQazWXR9uB6Q7R2xLxGQ
[2014-03-12 21:59:57,733][INFO ][http ] [Node ES #1]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/
172.16.0.100:9200]}
[2014-03-12 21:59:57,735][INFO ][node ] [Node ES #1]
started
[2014-03-12 21:59:59,569][INFO ][gateway ] [Node ES #1]
recovered [2] indices into cluster_state

Then I start the second node :
root@lada:/var/lib/lxc/kibana# date && lxc-start -n es_node2 -d
mercredi 12 mars 2014, 23:02:59 (UTC+0100)

Logon on the second node, and open the log :
[2014-03-12 22:03:02,126][INFO ][node ] [Node ES #2]
version[0.90.12], pid[1128], build[26feed7/2014-02-25T15:38:23Z]
[2014-03-12 22:03:02,127][INFO ][node ] [Node ES #2]
initializing ...
[2014-03-12 22:03:02,141][INFO ][plugins ] [Node ES #2]
loaded [], sites []
[2014-03-12 22:03:07,352][INFO ][node ] [Node ES #2]
initialized
[2014-03-12 22:03:07,352][INFO ][node ] [Node ES #2]
starting ...
[2014-03-12 22:03:07,557][INFO ][transport ] [Node ES #2]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/
172.16.0.101:9300]}
[2014-03-12 22:03:17,637][INFO ][cluster.service ] [Node ES #2]
new_master [Node ES
#2][0nNCsZrFS6y95G1ld-v_rA][inet[/172.16.0.101:9300]]{master=true}, reason:
zen-disco-join (elected_as_master)
[2014-03-12 22:03:17,718][INFO ][discovery ] [Node ES #2]
logstash/0nNCsZrFS6y95G1ld-v_rA
[2014-03-12 22:03:17,783][INFO ][http ] [Node ES #2]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/
172.16.0.101:9200]}
[2014-03-12 22:03:17,785][INFO ][node ] [Node ES #2]
started
[2014-03-12 22:03:19,550][INFO ][gateway ] [Node ES #2]
recovered [2] indices into cluster_state
[2014-03-12 22:03:52,709][WARN ][discovery.zen.ping.multicast] [Node ES #2]
received ping response ping_response{target [[Node ES
#1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}], master
[[Node ES
#1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}],
cluster_name[logstash]} with no matching id [1]

At that point, each node considered themselves as master.

Here's my configuration for each node (same for node 1, except the
node.name) :
cluster.name: logstash
node.name: "Node ES #2"
node.master: true
node.data: true
index.number_of_shards: 2
index.number_of_replicas: 1
discovery.zen.ping.timeout: 10s

The bridge on my host is setup to forward immediately every new
interfaces so I don't think the problem is here. Here's the bridge config :
auto br1
iface br1 inet static
address 172.16.0.254
netmask 255.255.255.0
bridge_ports regex veth_.*
bridge_spt off
bridge_maxwait 0<span style="color: #000;" cl

...

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/d86aec61-e851-48ac-a7fb-fae757f3eebe%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/d86aec61-e851-48ac-a7fb-fae757f3eebe%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3f208001-88c5-467c-bce4-f72902c2cd28%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Guillaume Loetscher) #9

@Xiao Yu : nope, it's not working also.

@Clinton Gormley : Yes, just after the "no matching id" error, a telnet
from Node 1 to node 2 is possible, and I got a valid connection.

All, please reming that after such issue, if I manually stop the service on
Node 2, then restart it, it will manage to reach the cluster without a
problem.

I'm suspecting a "race condition" here, something like "Node 2 container is
booting so fast that the bridge is not ready to handle the multicast
packet, leading to a connection problem".

Le jeudi 13 mars 2014 14:00:40 UTC+1, Xiao Yu a écrit :

Total shot in the dark here but try taking the hashmark out of the node
names and see if that helps?

On Thursday, March 13, 2014 5:31:30 AM UTC-4, Guillaume Loetscher wrote:

Sure

Node # 1:
root@es_node1:~# grep -E '^[^#]' /etc/elasticsearch/elasticsearch.yml
cluster.name: logstash
node.name: "Node ES #1"
node.master: true
node.data: true
index.number_of_shards: 2
index.number_of_replicas: 1
discovery.zen.ping.timeout: 10s

Node #2 :
root@es_node2:~# grep -E '^[^#]' /etc/elasticsearch/elasticsearch.yml
cluster.name: logstash
node.name: "Node ES #2"
node.master: true
node.data: true
index.number_of_shards: 2
index.number_of_replicas: 1
discovery.zen.ping.timeout: 10s

Le jeudi 13 mars 2014 10:15:16 UTC+1, David Pilato a écrit :

did you set the same cluster name on both nodes?

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

Le 13 mars 2014 à 09:57:35, Guillaume Loetscher (ster...@gmail.com) a
écrit:

Hi,

First, thanks for the answers and remarks.

You are both right, the issue I'm currently facing leads to a
"split-brain" situation, where Node #1 & Node #2 are both master, and doing
their own life on their side. I'll see to change my configuration and the
number of node, in order to limit this situation (I already checked this
article talking about split-brain in EShttp://blog.trifork.com/2013/10/24/how-to-avoid-the-split-brain-problem-in-elasticsearch/
).

However, this split-brain situation is the result of the problem with
the discovery / broadcast, which is represented in the log of Node #2 here :
[2014-03-12 22:03:52,709][WARN ][discovery.zen.ping.multicast] [Node ES #2]
received ping response ping_response{target [[Node ES
#1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}],
master [[Node ES #1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}],
cluster_name[logstash]} with no matching id [1]

So, the connectivity between Node #1 (which is the first one online, and
therefore master) and Node #2 is established, as the log on Node #2 clearly
said "received ping response", but with an "ID that didn't match".

This is apparently why Node #2 didn't join the cluster on Node #1, and
this is this specific issue I want to resolve.

Thanks,

Le jeudi 13 mars 2014 07:03:35 UTC+1, David Pilato a écrit :

Bonjour :slight_smile:

You should set min_master_nodes to 2. Although I'd recommend having 3
nodes instead of 2.

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 12 mars 2014 à 23:58, Guillaume Loetscher ster...@gmail.com a
écrit :

Hi,

I've begun to test Elasticsearch recently, on a little mockup I've
designed.

Currently, I'm running two nodes on two LXC (v0.9) containers. Those
containers are linked using veth to a bridge declared on the host.

When I start the first node, the cluster starts, but when I start the
second node a bit later, it seems to get some information from the other
node but it always ended with the same "no matchind id" error.

Here's what I'm doing :

I start the LXC container of the first node :
root@lada:~# date && lxc-start -n es_node1 -d
mercredi 12 mars 2014, 22:59:39 (UTC+0100)

I logon the node, check the log file :
[2014-03-12 21:59:41,927][INFO ][node ] [Node ES #1]
version[0.90.12], pid[1129], build[26feed7/2014-02-25T15:38:23Z]
[2014-03-12 21:59:41,928][INFO ][node ] [Node ES #1]
initializing ...
[2014-03-12 21:59:41,944][INFO ][plugins ] [Node ES #1]
loaded [], sites []
[2014-03-12 21:59:47,262][INFO ][node ] [Node ES #1]
initialized
[2014-03-12 21:59:47,263][INFO ][node ] [Node ES #1]
starting ...
[2014-03-12 21:59:47,485][INFO ][transport ] [Node ES #1]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/
172.16.0.100:9300]}
[2014-03-12 21:59:57,573][INFO ][cluster.service ] [Node ES #1]
new_master [Node ES
#1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}, reason:
zen-disco-join (elected_as_master)
[2014-03-12 21:59:57,657][INFO ][discovery ] [Node ES #1]
logstash/LbMQazWXR9uB6Q7R2xLxGQ
[2014-03-12 21:59:57,733][INFO ][http ] [Node ES #1]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/
172.16.0.100:9200]}
[2014-03-12 21:59:57,735][INFO ][node ] [Node ES #1]
started
[2014-03-12 21:59:59,569][INFO ][gateway ] [Node ES #1]
recovered [2] indices into cluster_state

Then I start the second node :
root@lada:/var/lib/lxc/kibana# date && lxc-start -n es_node2 -d
mercredi 12 mars 2014, 23:02:59 (UTC+0100)

Logon on the second node, and open the log :
[2014-03-12 22:03:02,126][INFO ][node ] [Node ES #2]
version[0.90.12], pid[1128], build[26feed7/2014-02-25T15:38:23Z]
[2014-03-12 22:03:02,127][INFO ][node ] [Node ES #2]
initializing ...
[2014-03-12 22:03:02,141][INFO ][plugins ] [Node ES #2]
loaded [], sites []
[2014-03-12 22:03:07,352][INFO ][node ] [Node ES #2]
initialized
[2014-03-12 22:03:07,352][INFO ][node ] [Node ES #2]
starting ...
[2014-03-12 22:03:07,557][INFO ][transport ] [Node ES #2]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/
172.16.0.101:9300]}
[2014-03-12 22:03:17,637][INFO ][cluster.service ] [Node ES #2]
new_master [Node ES
#2][0nNCsZrFS6y95G1ld-v_rA][inet[/172.16.0.101:9300]]{master=true}, reason:
zen-disco-join (elected_as_master)
[2014-03-12 22:03:17,718][INFO ][discovery ] [Node ES #2]
logstash/0nNCsZrFS6y95G1ld-v_rA
[2014-03-12 22:03:17,783][INFO ][http ] [Node ES #2]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/
172.16.0.101:9200]}
[2014-03-12 22:03:17,785][INFO ][node ] [Node ES #2]
started
[2014-03-12 22:03:19,550][INFO ][gateway ] [Node ES #2]
recovered [2] indices into cluster_state
[2014-03-12 22:03:52,709][WARN ][discovery.zen.ping.multicast] [Node ES #2]
received ping response ping_response{target [[Node ES
#1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}], master
[[Node ES
#1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}],
cluster_name[logstash]} with no matching id [1]

At that point, each node considered themselves as master.

Here's my configuration for each node (same for node 1, except the
node.name) :
cluster.name: logstash
node.name: "Node ES #2"
node.master: true
node.data: true
index.number_of_shards: 2
index.number_of_replicas: 1
discovery.zen.ping.timeout: 10s

The bridge on my host is setup to forward immediately every new
interfaces so I don't think the problem is here. Here's the bridge config :
auto br1
iface br1 inet static
address 172.16.0.254
netmask 255.255.255.0
bridge_ports regex veth_.*
bridge_spt off
bridge_maxwait 0<span style="color: #000;" cl

...

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/d86aec61-e851-48ac-a7fb-fae757f3eebe%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/d86aec61-e851-48ac-a7fb-fae757f3eebe%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3d4064b8-4c91-4b49-9367-7f7ce9a9ba44%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Guillaume Loetscher) #10

Definitely a multicast problem.

I've decided to switch to unicast, and I manage to shutdown any nodes
(elected master or not), and the remaining one is taking up the load
perfectly.

When the other node is getting back online, using the unicast discovery,
there's no problem, the elected master discovered another master node, and
add it in the cluster.

I don't know what was failing in my (virtual) network configuration, but
honestly, I cannot lost several more hours to point out where my mockup
failed for multicast.

Le jeudi 13 mars 2014 15:23:06 UTC+1, Guillaume Loetscher a écrit :

@Xiao Yu : nope, it's not working also.

@Clinton Gormley : Yes, just after the "no matching id" error, a telnet
from Node 1 to node 2 is possible, and I got a valid connection.

All, please reming that after such issue, if I manually stop the service
on Node 2, then restart it, it will manage to reach the cluster without a
problem.

I'm suspecting a "race condition" here, something like "Node 2 container
is booting so fast that the bridge is not ready to handle the multicast
packet, leading to a connection problem".

Le jeudi 13 mars 2014 14:00:40 UTC+1, Xiao Yu a écrit :

Total shot in the dark here but try taking the hashmark out of the node
names and see if that helps?

On Thursday, March 13, 2014 5:31:30 AM UTC-4, Guillaume Loetscher wrote:

Sure

Node # 1:
root@es_node1:~# grep -E '^[^#]' /etc/elasticsearch/elasticsearch.yml
cluster.name: logstash
node.name: "Node ES #1"
node.master: true
node.data: true
index.number_of_shards: 2
index.number_of_replicas: 1
discovery.zen.ping.timeout: 10s

Node #2 :
root@es_node2:~# grep -E '^[^#]' /etc/elasticsearch/elasticsearch.yml
cluster.name: logstash
node.name: "Node ES #2"
node.master: true
node.data: true
index.number_of_shards: 2
index.number_of_replicas: 1
discovery.zen.ping.timeout: 10s

Le jeudi 13 mars 2014 10:15:16 UTC+1, David Pilato a écrit :

did you set the same cluster name on both nodes?

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

Le 13 mars 2014 à 09:57:35, Guillaume Loetscher (ster...@gmail.com) a
écrit:

Hi,

First, thanks for the answers and remarks.

You are both right, the issue I'm currently facing leads to a
"split-brain" situation, where Node #1 & Node #2 are both master, and doing
their own life on their side. I'll see to change my configuration and the
number of node, in order to limit this situation (I already checked this
article talking about split-brain in EShttp://blog.trifork.com/2013/10/24/how-to-avoid-the-split-brain-problem-in-elasticsearch/
).

However, this split-brain situation is the result of the problem with
the discovery / broadcast, which is represented in the log of Node #2 here :
[2014-03-12 22:03:52,709][WARN ][discovery.zen.ping.multicast] [Node
ES #2] received ping response ping_response{target [[Node ES
#1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}],
master [[Node ES #1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}],
cluster_name[logstash]} with no matching id [1]

So, the connectivity between Node #1 (which is the first one online,
and therefore master) and Node #2 is established, as the log on Node #2
clearly said "received ping response", but with an "ID that didn't match".

This is apparently why Node #2 didn't join the cluster on Node #1, and
this is this specific issue I want to resolve.

Thanks,

Le jeudi 13 mars 2014 07:03:35 UTC+1, David Pilato a écrit :

Bonjour :slight_smile:

You should set min_master_nodes to 2. Although I'd recommend having 3
nodes instead of 2.

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 12 mars 2014 à 23:58, Guillaume Loetscher ster...@gmail.com a
écrit :

Hi,

I've begun to test Elasticsearch recently, on a little mockup I've
designed.

Currently, I'm running two nodes on two LXC (v0.9) containers. Those
containers are linked using veth to a bridge declared on the host.

When I start the first node, the cluster starts, but when I start the
second node a bit later, it seems to get some information from the other
node but it always ended with the same "no matchind id" error.

Here's what I'm doing :

I start the LXC container of the first node :
root@lada:~# date && lxc-start -n es_node1 -d
mercredi 12 mars 2014, 22:59:39 (UTC+0100)

I logon the node, check the log file :
[2014-03-12 21:59:41,927][INFO ][node ] [Node ES #1]
version[0.90.12], pid[1129], build[26feed7/2014-02-25T15:38:23Z]
[2014-03-12 21:59:41,928][INFO ][node ] [Node ES #1]
initializing ...
[2014-03-12 21:59:41,944][INFO ][plugins ] [Node ES #1]
loaded [], sites []
[2014-03-12 21:59:47,262][INFO ][node ] [Node ES #1]
initialized
[2014-03-12 21:59:47,263][INFO ][node ] [Node ES #1]
starting ...
[2014-03-12 21:59:47,485][INFO ][transport ] [Node ES #1]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/
172.16.0.100:9300]}
[2014-03-12 21:59:57,573][INFO ][cluster.service ] [Node ES #1]
new_master [Node ES
#1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}, reason:
zen-disco-join (elected_as_master)
[2014-03-12 21:59:57,657][INFO ][discovery ] [Node ES #1]
logstash/LbMQazWXR9uB6Q7R2xLxGQ
[2014-03-12 21:59:57,733][INFO ][http ] [Node ES #1]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/
172.16.0.100:9200]}
[2014-03-12 21:59:57,735][INFO ][node ] [Node ES #1]
started
[2014-03-12 21:59:59,569][INFO ][gateway ] [Node ES #1]
recovered [2] indices into cluster_state

Then I start the second node :
root@lada:/var/lib/lxc/kibana# date && lxc-start -n es_node2 -d
mercredi 12 mars 2014, 23:02:59 (UTC+0100)

Logon on the second node, and open the log :
[2014-03-12 22:03:02,126][INFO ][node ] [Node ES #2]
version[0.90.12], pid[1128], build[26feed7/2014-02-25T15:38:23Z]
[2014-03-12 22:03:02,127][INFO ][node ] [Node ES #2]
initializing ...
[2014-03-12 22:03:02,141][INFO ][plugins ] [Node ES #2]
loaded [], sites []
[2014-03-12 22:03:07,352][INFO ][node ] [Node ES #2]
initialized
[2014-03-12 22:03:07,352][INFO ][node ] [Node ES #2]
starting ...
[2014-03-12 22:03:07,557][INFO ][transport ] [Node ES #2]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/
172.16.0.101:9300]}
[2014-03-12 22:03:17,637][INFO ][cluster.service ] [Node ES #2]
new_master [Node ES
#2][0nNCsZrFS6y95G1ld-v_rA][inet[/172.16.0.101:9300]]{master=true}, reason:
zen-disco-join (elected_as_master)
[2014-03-12 22:03:17,718][INFO ][discovery ] [Node ES #2]
logstash/0nNCsZrFS6y95G1ld-v_rA
[2014-03-12 22:03:17,783][INFO ][http ] [Node ES #2]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/
172.16.0.101:9200]}
[2014-03-12 22:03:17,785][INFO ][node ] [Node ES #2]
started
[2014-03-12 22:03:19,550][INFO ][gateway ] [Node ES #2]
recovered [2] indices into cluster_state
[2014-03-12 22:03:52,709][WARN ][discovery.zen.ping.multicast] [Node
ES #2] received ping response ping_response{target [[Node ES
#1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}], master
[[Node ES
#1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}],
cluster_name[logstash]} with no matching id [1]

At that point, each node considered themselves as master.

Here's my configuration for each node (same for node 1, except the
node.name) :
cluster.name: logstash
node.name: "Node ES #2"
node.master: true
node.data: true
index.number_of_shards: 2
index.number_of_replicas: 1
discovery.zen.ping.timeout: 10s

The bridge on my host is setup to forward immediately every new
interfaces so I don't think the problem is here. Here's the bridge config :
auto br1
iface br1 inet static
address 172.16.0.254
netmask 255.255.255.0
bridge_ports regex veth_.*
bridge_spt off
bridge_maxwait 0<span style="color: #000;" cl

...

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/d86aec61-e851-48ac-a7fb-fae757f3eebe%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/d86aec61-e851-48ac-a7fb-fae757f3eebe%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/212eb684-a2c6-4901-a7a1-7c51ad1e656b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Jörg Prante) #11

Enter

ip addr show

or

ifconfig

and check if MULTICAST is configured on the interface.

Jörg

On Thu, Mar 13, 2014 at 4:29 PM, Guillaume Loetscher sterfield@gmail.comwrote:

Definitely a multicast problem.

I've decided to switch to unicast, and I manage to shutdown any nodes
(elected master or not), and the remaining one is taking up the load
perfectly.

When the other node is getting back online, using the unicast discovery,
there's no problem, the elected master discovered another master node, and
add it in the cluster.

I don't know what was failing in my (virtual) network configuration, but
honestly, I cannot lost several more hours to point out where my mockup
failed for multicast.

Le jeudi 13 mars 2014 15:23:06 UTC+1, Guillaume Loetscher a écrit :

@Xiao Yu : nope, it's not working also.

@Clinton Gormley : Yes, just after the "no matching id" error, a telnet
from Node 1 to node 2 is possible, and I got a valid connection.

All, please reming that after such issue, if I manually stop the service
on Node 2, then restart it, it will manage to reach the cluster without a
problem.

I'm suspecting a "race condition" here, something like "Node 2 container
is booting so fast that the bridge is not ready to handle the multicast
packet, leading to a connection problem".

Le jeudi 13 mars 2014 14:00:40 UTC+1, Xiao Yu a écrit :

Total shot in the dark here but try taking the hashmark out of the node
names and see if that helps?

On Thursday, March 13, 2014 5:31:30 AM UTC-4, Guillaume Loetscher wrote:

Sure

Node # 1:
root@es_node1:~# grep -E '^[^#]' /etc/elasticsearch/elasticsearch.yml
cluster.name: logstash
node.name: "Node ES #1"
node.master: true
node.data: true
index.number_of_shards: 2
index.number_of_replicas: 1
discovery.zen.ping.timeout: 10s

Node #2 :
root@es_node2:~# grep -E '^[^#]' /etc/elasticsearch/elasticsearch.yml
cluster.name: logstash
node.name: "Node ES #2"
node.master: true
node.data: true
index.number_of_shards: 2
index.number_of_replicas: 1
discovery.zen.ping.timeout: 10s

Le jeudi 13 mars 2014 10:15:16 UTC+1, David Pilato a écrit :

did you set the same cluster name on both nodes?

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

Le 13 mars 2014 à 09:57:35, Guillaume Loetscher (ster...@gmail.com) a
écrit:

Hi,

First, thanks for the answers and remarks.

You are both right, the issue I'm currently facing leads to a
"split-brain" situation, where Node #1 & Node #2 are both master, and doing
their own life on their side. I'll see to change my configuration and the
number of node, in order to limit this situation (I already checked this
article talking about split-brain in EShttp://blog.trifork.com/2013/10/24/how-to-avoid-the-split-brain-problem-in-elasticsearch/
).

However, this split-brain situation is the result of the problem with
the discovery / broadcast, which is represented in the log of Node #2 here :
[2014-03-12 22:03:52,709][WARN ][discovery.zen.ping.multicast] [Node
ES #2] received ping response ping_response{target [[Node ES
#1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}],
master [[Node ES #1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300
]]{master=true}], cluster_name[logstash]} with no matching id [1]

So, the connectivity between Node #1 (which is the first one online,
and therefore master) and Node #2 is established, as the log on Node #2
clearly said "received ping response", but with an "ID that didn't match".

This is apparently why Node #2 didn't join the cluster on Node #1, and
this is this specific issue I want to resolve.

Thanks,

Le jeudi 13 mars 2014 07:03:35 UTC+1, David Pilato a écrit :

Bonjour :slight_smile:

You should set min_master_nodes to 2. Although I'd recommend having 3
nodes instead of 2.

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 12 mars 2014 à 23:58, Guillaume Loetscher ster...@gmail.com a
écrit :

Hi,

I've begun to test Elasticsearch recently, on a little mockup I've
designed.

Currently, I'm running two nodes on two LXC (v0.9) containers. Those
containers are linked using veth to a bridge declared on the host.

When I start the first node, the cluster starts, but when I start the
second node a bit later, it seems to get some information from the other
node but it always ended with the same "no matchind id" error.

Here's what I'm doing :

I start the LXC container of the first node :
root@lada:~# date && lxc-start -n es_node1 -d
mercredi 12 mars 2014, 22:59:39 (UTC+0100)

I logon the node, check the log file :
[2014-03-12 21:59:41,927][INFO ][node ] [Node ES #1]
version[0.90.12], pid[1129], build[26feed7/2014-02-25T15:38:23Z]
[2014-03-12 21:59:41,928][INFO ][node ] [Node ES #1]
initializing ...
[2014-03-12 21:59:41,944][INFO ][plugins ] [Node ES #1]
loaded [], sites []
[2014-03-12 21:59:47,262][INFO ][node ] [Node ES #1]
initialized
[2014-03-12 21:59:47,263][INFO ][node ] [Node ES #1]
starting ...
[2014-03-12 21:59:47,485][INFO ][transport ] [Node ES #1]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/
172.16.0.100:9300]}
[2014-03-12 21:59:57,573][INFO ][cluster.service ] [Node ES #1]
new_master [Node ES #1][LbMQazWXR9uB6Q7R2xLxGQ][
inet[/172.16.0.100:9300]]{master=true}, reason: zen-disco-join
(elected_as_master)
[2014-03-12 21:59:57,657][INFO ][discovery ] [Node ES #1]
logstash/LbMQazWXR9uB6Q7R2xLxGQ
[2014-03-12 21:59:57,733][INFO ][http ] [Node ES #1]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/
172.16.0.100:9200]}
[2014-03-12 21:59:57,735][INFO ][node ] [Node ES #1]
started
[2014-03-12 21:59:59,569][INFO ][gateway ] [Node ES #1]
recovered [2] indices into cluster_state

Then I start the second node :
root@lada:/var/lib/lxc/kibana# date && lxc-start -n es_node2 -d
mercredi 12 mars 2014, 23:02:59 (UTC+0100)

Logon on the second node, and open the log :
[2014-03-12 22:03:02,126][INFO ][node ] [Node ES #2]
version[0.90.12], pid[1128], build[26feed7/2014-02-25T15:38:23Z]
[2014-03-12 22:03:02,127][INFO ][node ] [Node ES #2]
initializing ...
[2014-03-12 22:03:02,141][INFO ][plugins ] [Node ES #2]
loaded [], sites []
[2014-03-12 22:03:07,352][INFO ][node ] [Node ES #2]
initialized
[2014-03-12 22:03:07,352][INFO ][node ] [Node ES #2]
starting ...
[2014-03-12 22:03:07,557][INFO ][transport ] [Node ES #2]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/
172.16.0.101:9300]}
[2014-03-12 22:03:17,637][INFO ][cluster.service ] [Node ES #2]
new_master [Node ES #2][0nNCsZrFS6y95G1ld-v_rA][
inet[/172.16.0.101:9300]]{master=true}, reason: zen-disco-join
(elected_as_master)
[2014-03-12 22:03:17,718][INFO ][discovery ] [Node ES #2]
logstash/0nNCsZrFS6y95G1ld-v_rA
[2014-03-12 22:03:17,783][INFO ][http ] [Node ES #2]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/
172.16.0.101:9200]}
[2014-03-12 22:03:17,785][INFO ][node ] [Node ES #2]
started
[2014-03-12 22:03:19,550][INFO ][gateway ] [Node ES #2]
recovered [2] indices into cluster_state
[2014-03-12 22:03:52,709][WARN ][discovery.zen.ping.multicast] [Node
ES #2] received ping response ping_response{target [[Node ES
#1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}],
master [[Node ES #1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300
]]{master=true}], cluster_name[logstash]} with no matching id [1]

At that point, each node considered themselves as master.

Here's my configuration for each node (same for node 1, except the
node.name) :
cluster.name: logstash
node.name: "Node ES #2"
node.master: true
node.data: true
index.number_of_shards: 2
index.number_of_replicas: 1
discovery.zen.ping.timeout: 10s

The bridge on my host is setup to forward immediately every new
interfaces so I don't think the problem is here. Here's the bridge config :
auto br1
iface br1 inet static
address 172.16.0.254
netmask 255.255.255.0
bridge_ports regex veth_.*
bridge_spt off
bridge_maxwait 0<span style="color: #000;" cl

...

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/d86aec61-e851-48ac-a7fb-fae757f3eebe%
40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/d86aec61-e851-48ac-a7fb-fae757f3eebe%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/212eb684-a2c6-4901-a7a1-7c51ad1e656b%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/212eb684-a2c6-4901-a7a1-7c51ad1e656b%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFxpSzYUWAFdqq0yHWDsfvJbG3HGRRwhbDmJMdMnJcvfQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Guillaume Loetscher) #12

Nice try :wink:

On node 1 :
root@es_node1:~# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode
DEFAULT
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
184: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast
state UP mode DEFAULT qlen 1000
link/ether a2:f3:84:d1:43:64 brd ff:ff:ff:ff:ff:ff

On Node 2 :
root@es_node2:~# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode
DEFAULT
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
182: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast
state UP mode DEFAULT qlen 1000
link/ether 26:ac:8b:14:58:ac brd ff:ff:ff:ff:ff:ff

On host :

84: br1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
mode DEFAULT group default
link/ether fe:74:89:32:c8:30 brd ff:ff:ff:ff:ff:ff
183: veth9bZZVH: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
pfifo_fast master br1 state UP mode DEFAULT group default qlen 1000
link/ether fe:74:89:32:c8:30 brd ff:ff:ff:ff:ff:ff
185: vethIG4AFA: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
pfifo_fast master br1 state UP mode DEFAULT group default qlen 1000
link/ether fe:be:16:bf:53:4f brd ff:ff:ff:ff:ff:ff
187: veth_logstash: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
pfifo_fast master br1 state UP mode DEFAULT group default qlen 1000
link/ether fe:93:8a:74:89:5a brd ff:ff:ff:ff:ff:ff

Le jeudi 13 mars 2014 16:47:37 UTC+1, Jörg Prante a écrit :

Enter

ip addr show

or

ifconfig

and check if MULTICAST is configured on the interface.

Jörg

On Thu, Mar 13, 2014 at 4:29 PM, Guillaume Loetscher <ster...@gmail.com<javascript:>

wrote:

Definitely a multicast problem.

I've decided to switch to unicast, and I manage to shutdown any nodes
(elected master or not), and the remaining one is taking up the load
perfectly.

When the other node is getting back online, using the unicast discovery,
there's no problem, the elected master discovered another master node, and
add it in the cluster.

I don't know what was failing in my (virtual) network configuration, but
honestly, I cannot lost several more hours to point out where my mockup
failed for multicast.

Le jeudi 13 mars 2014 15:23:06 UTC+1, Guillaume Loetscher a écrit :

@Xiao Yu : nope, it's not working also.

@Clinton Gormley : Yes, just after the "no matching id" error, a telnet
from Node 1 to node 2 is possible, and I got a valid connection.

All, please reming that after such issue, if I manually stop the service
on Node 2, then restart it, it will manage to reach the cluster without a
problem.

I'm suspecting a "race condition" here, something like "Node 2 container
is booting so fast that the bridge is not ready to handle the multicast
packet, leading to a connection problem".

Le jeudi 13 mars 2014 14:00:40 UTC+1, Xiao Yu a écrit :

Total shot in the dark here but try taking the hashmark out of the node
names and see if that helps?

On Thursday, March 13, 2014 5:31:30 AM UTC-4, Guillaume Loetscher wrote:

Sure

Node # 1:
root@es_node1:~# grep -E '^[^#]' /etc/elasticsearch/elasticsearch.yml
cluster.name: logstash
node.name: "Node ES #1"
node.master: true
node.data: true
index.number_of_shards: 2
index.number_of_replicas: 1
discovery.zen.ping.timeout: 10s

Node #2 :
root@es_node2:~# grep -E '^[^#]' /etc/elasticsearch/elasticsearch.yml
cluster.name: logstash
node.name: "Node ES #2"
node.master: true
node.data: true
index.number_of_shards: 2
index.number_of_replicas: 1
discovery.zen.ping.timeout: 10s

Le jeudi 13 mars 2014 10:15:16 UTC+1, David Pilato a écrit :

did you set the same cluster name on both nodes?

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

Le 13 mars 2014 à 09:57:35, Guillaume Loetscher (ster...@gmail.com)
a écrit:

Hi,

First, thanks for the answers and remarks.

You are both right, the issue I'm currently facing leads to a
"split-brain" situation, where Node #1 & Node #2 are both master, and doing
their own life on their side. I'll see to change my configuration and the
number of node, in order to limit this situation (I already checked this
article talking about split-brain in EShttp://blog.trifork.com/2013/10/24/how-to-avoid-the-split-brain-problem-in-elasticsearch/
).

However, this split-brain situation is the result of the problem with
the discovery / broadcast, which is represented in the log of Node #2 here :
[2014-03-12 22:03:52,709][WARN ][discovery.zen.ping.multicast] [Node
ES #2] received ping response ping_response{target [[Node ES
#1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}],
master [[Node ES #1][LbMQazWXR9uB6Q7R2xLxGQ][in
et[/172.16.0.100:9300]]{master=true}], cluster_name[logstash]} with
no matching id [1]

So, the connectivity between Node #1 (which is the first one online,
and therefore master) and Node #2 is established, as the log on Node #2
clearly said "received ping response", but with an "ID that didn't match".

This is apparently why Node #2 didn't join the cluster on Node #1,
and this is this specific issue I want to resolve.

Thanks,

Le jeudi 13 mars 2014 07:03:35 UTC+1, David Pilato a écrit :

Bonjour :slight_smile:

You should set min_master_nodes to 2. Although I'd recommend having
3 nodes instead of 2.

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 12 mars 2014 à 23:58, Guillaume Loetscher ster...@gmail.com a
écrit :

Hi,

I've begun to test Elasticsearch recently, on a little mockup I've
designed.

Currently, I'm running two nodes on two LXC (v0.9) containers. Those
containers are linked using veth to a bridge declared on the host.

When I start the first node, the cluster starts, but when I start
the second node a bit later, it seems to get some information from the
other node but it always ended with the same "no matchind id" error.

Here's what I'm doing :

I start the LXC container of the first node :
root@lada:~# date && lxc-start -n es_node1 -d
mercredi 12 mars 2014, 22:59:39 (UTC+0100)

I logon the node, check the log file :
[2014-03-12 21:59:41,927][INFO ][node ] [Node
ES #1] version[0.90.12], pid[1129], build[26feed7/2014-02-25T15:
38:23Z]
[2014-03-12 21:59:41,928][INFO ][node ] [Node ES #1]
initializing ...
[2014-03-12 21:59:41,944][INFO ][plugins ] [Node ES #1]
loaded [], sites []
[2014-03-12 21:59:47,262][INFO ][node ] [Node ES #1]
initialized
[2014-03-12 21:59:47,263][INFO ][node ] [Node ES #1]
starting ...
[2014-03-12 21:59:47,485][INFO ][transport ] [Node ES #1]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/
172.16.0.100:9300]}
[2014-03-12 21:59:57,573][INFO ][cluster.service ] [Node ES #1]
new_master [Node ES #1][LbMQazWXR9uB6Q7R2xLxGQ][
inet[/172.16.0.100:9300]]{master=true}, reason: zen-disco-join
(elected_as_master)
[2014-03-12 21:59:57,657][INFO ][discovery ] [Node ES #1]
logstash/LbMQazWXR9uB6Q7R2xLxGQ
[2014-03-12 21:59:57,733][INFO ][http ] [Node ES #1]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/
172.16.0.100:9200]}
[2014-03-12 21:59:57,735][INFO ][node ] [Node ES #1]
started
[2014-03-12 21:59:59,569][INFO ][gateway ] [Node ES #1]
recovered [2] indices into cluster_state

Then I start the second node :
root@lada:/var/lib/lxc/kibana# date && lxc-start -n es_node2 -d
mercredi 12 mars 2014, 23:02:59 (UTC+0100)

Logon on the second node, and open the log :
[2014-03-12 22:03:02,126][INFO ][node ] [Node
ES #2] version[0.90.12], pid[1128], build[26feed7/2014-02-25T15:
38:23Z]
[2014-03-12 22:03:02,127][INFO ][node ] [Node ES #2]
initializing ...
[2014-03-12 22:03:02,141][INFO ][plugins ] [Node ES #2]
loaded [], sites []
[2014-03-12 22:03:07,352][INFO ][node ] [Node ES #2]
initialized
[2014-03-12 22:03:07,352][INFO ][node ] [Node ES #2]
starting ...
[2014-03-12 22:03:07,557][INFO ][transport ] [Node ES #2]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/
172.16.0.101:9300]}
[2014-03-12 22:03:17,637][INFO ][cluster.service ] [Node ES #2]
new_master [Node ES #2][0nNCsZrFS6y95G1ld-v_rA][
inet[/172.16.0.101:9300]]{master=true}, reason: zen-disco-join
(elected_as_master)
[2014-03-12 22:03:17,718][INFO ][discovery ] [Node ES #2]
logstash/0nNCsZrFS6y95G1ld-v_rA
[2014-03-12 22:03:17,783][INFO ][http ] [Node ES #2]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/
172.16.0.101:9200]}
[2014-03-12 22:03:17,785][INFO ][node ] [Node ES #2]
started
[2014-03-12 22:03:19,550][INFO ][gateway ] [Node ES #2]
recovered [2] indices into cluster_state
[2014-03-12 22:03:52,709][WARN ][discovery.zen.ping.multicast] [Node
ES #2] received ping response ping_response{target [[Node ES
#1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}],
master [[Node ES #1][LbMQazWXR9uB6Q7R2xLxGQ][
inet[/172.16.0.100:9300]]{master=true}], cluster_name[logstash]}
with no matching id [1]

At that point, each node considered themselves as master.

Here's my configuration for each node (same for node 1, except the
node.name) :
cluster.name: logstash
node.name: "Node ES #2"
node.master: true
node.data: true
index.number_of_shards: 2
index.number_of_replicas: 1
discovery.zen.ping.timeout: 10s

The bridge on my host is setup to forward immediately every new
interfaces so I don't think the problem is here. Here's the bridge config :
auto br1
iface br1 inet static
address 172.16.0.254
netmask 255.255.255.0
bridge_ports regex veth_.*
bridge_spt off
bridge_maxwait 0<span style="color: #000;" cl

...

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/d86aec61-e851-48ac-a7fb-fae757f3eebe%
40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/d86aec61-e851-48ac-a7fb-fae757f3eebe%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/212eb684-a2c6-4901-a7a1-7c51ad1e656b%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/212eb684-a2c6-4901-a7a1-7c51ad1e656b%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/577d3510-0a8a-4bcb-a35f-346d360fcac3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Guillaume Loetscher) #13

OK I may have solve my problem.

I did a quick check on my configuration, and it appears that, for some
reason, LXC had decided to put the same MAC address between the bridge and
one container. See the "br1" and "veth9bZZVH" below :

84: br1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
mode DEFAULT group default
link/ether fe:74:89:32:c8:30 brd ff:ff:ff:ff:ff:ff
183: veth9bZZVH: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
pfifo_fast master br1 state UP mode DEFAULT group default qlen 1000
link/ether fe:74:89:32:c8:30 brd ff:ff:ff:ff:ff:ff
185: vethIG4AFA: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
pfifo_fast master br1 state UP mode DEFAULT group default qlen 1000
link/ether fe:be:16:bf:53:4f brd ff:ff:ff:ff:ff:ff
187: veth_logstash: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
pfifo_fast master br1 state UP mode DEFAULT group default qlen 1000
link/ether fe:93:8a:74:89:5a brd ff:ff:ff:ff:ff:ff

I manually forced the MAC address (and the name) of all veth on my
containers, and restarted all the nodes.

Guess what ? The multicast discovery is working without any problem now,
all containers are re-joining the cluster on boot, without any problem.

Many thanks for your help, guys !

Le jeudi 13 mars 2014 17:07:58 UTC+1, Guillaume Loetscher a écrit :

Nice try :wink:

On node 1 :
root@es_node1:~# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode
DEFAULT
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
184: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast
state UP mode DEFAULT qlen 1000
link/ether a2:f3:84:d1:43:64 brd ff:ff:ff:ff:ff:ff

On Node 2 :
root@es_node2:~# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode
DEFAULT
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
182: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast
state UP mode DEFAULT qlen 1000
link/ether 26:ac:8b:14:58:ac brd ff:ff:ff:ff:ff:ff

On host :

84: br1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state
UP mode DEFAULT group default
link/ether fe:74:89:32:c8:30 brd ff:ff:ff:ff:ff:ff
183: veth9bZZVH: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
pfifo_fast master br1 state UP mode DEFAULT group default qlen 1000
link/ether fe:74:89:32:c8:30 brd ff:ff:ff:ff:ff:ff
185: vethIG4AFA: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
pfifo_fast master br1 state UP mode DEFAULT group default qlen 1000
link/ether fe:be:16:bf:53:4f brd ff:ff:ff:ff:ff:ff
187: veth_logstash: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
pfifo_fast master br1 state UP mode DEFAULT group default qlen 1000
link/ether fe:93:8a:74:89:5a brd ff:ff:ff:ff:ff:ff

Le jeudi 13 mars 2014 16:47:37 UTC+1, Jörg Prante a écrit :

Enter

ip addr show

or

ifconfig

and check if MULTICAST is configured on the interface.

Jörg

On Thu, Mar 13, 2014 at 4:29 PM, Guillaume Loetscher ster...@gmail.comwrote:

Definitely a multicast problem.

I've decided to switch to unicast, and I manage to shutdown any nodes
(elected master or not), and the remaining one is taking up the load
perfectly.

When the other node is getting back online, using the unicast discovery,
there's no problem, the elected master discovered another master node, and
add it in the cluster.

I don't know what was failing in my (virtual) network configuration, but
honestly, I cannot lost several more hours to point out where my mockup
failed for multicast.

Le jeudi 13 mars 2014 15:23:06 UTC+1, Guillaume Loetscher a écrit :

@Xiao Yu : nope, it's not working also.

@Clinton Gormley : Yes, just after the "no matching id" error, a telnet
from Node 1 to node 2 is possible, and I got a valid connection.

All, please reming that after such issue, if I manually stop the service
on Node 2, then restart it, it will manage to reach the cluster without a
problem.

I'm suspecting a "race condition" here, something like "Node 2 container
is booting so fast that the bridge is not ready to handle the multicast
packet, leading to a connection problem".

Le jeudi 13 mars 2014 14:00:40 UTC+1, Xiao Yu a écrit :

Total shot in the dark here but try taking the hashmark out of the node
names and see if that helps?

On Thursday, March 13, 2014 5:31:30 AM UTC-4, Guillaume Loetscher wrote:

Sure

Node # 1:
root@es_node1:~# grep -E '^[^#]' /etc/elasticsearch/elasticsearch.yml
cluster.name: logstash
node.name: "Node ES #1"
node.master: true
node.data: true
index.number_of_shards: 2
index.number_of_replicas: 1
discovery.zen.ping.timeout: 10s

Node #2 :
root@es_node2:~# grep -E '^[^#]' /etc/elasticsearch/elasticsearch.yml
cluster.name: logstash
node.name: "Node ES #2"
node.master: true
node.data: true
index.number_of_shards: 2
index.number_of_replicas: 1
discovery.zen.ping.timeout: 10s

Le jeudi 13 mars 2014 10:15:16 UTC+1, David Pilato a écrit :

did you set the same cluster name on both nodes?

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr

Le 13 mars 2014 à 09:57:35, Guillaume Loetscher (ster...@gmail.com) a
écrit:

Hi,

First, thanks for the answers and remarks.

You are both right, the issue I'm currently facing leads to a
"split-brain" situation, where Node #1 & Node #2 are both master, and doing
their own life on their side. I'll see to change my configuration and the
number of node, in order to limit this situation (I already checked this
article talking about split-brain in EShttp://blog.trifork.com/2013/10/24/how-to-avoid-the-split-brain-problem-in-elasticsearch/
).

However, this split-brain situation is the result of the problem with the
discovery / broadcast, which is represented in the log of Node #2 here :
[2014-03-12 22:03:52,709][WARN ][discovery.zen.ping.multicast] [Node ES #2]
received ping response ping_response{target [[Node ES
#1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true}],
master [[Node ES #1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{
master=true}], cluster_name[logstash]} with no matching id [1]

So, the connectivity between Node #1 (which is the first one online, and
therefore master) and Node #2 is established, as the log on Node #2 clearly
said "received ping response", but with an "ID that didn't match".

This is apparently why Node #2 didn't join the cluster on Node #1, and
this is this specific issue I want to resolve.

Thanks,

Le jeudi 13 mars 2014 07:03:35 UTC+1, David Pilato a écrit :

Bonjour :slight_smile:

You should set min_master_nodes to 2. Although I'd recommend having 3
nodes instead of 2.

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 12 mars 2014 à 23:58, Guillaume Loetscher ster...@gmail.com a écrit :

Hi,

I've begun to test Elasticsearch recently, on a little mockup I've
designed.

Currently, I'm running two nodes on two LXC (v0.9) containers. Those
containers are linked using veth to a bridge declared on the host.

When I start the first node, the cluster starts, but when I start the
second node a bit later, it seems to get some information from the other
node but it always ended with the same "no matchind id" error.

Here's what I'm doing :

I start the LXC container of the first node :
root@lada:~# date && lxc-start -n es_node1 -d
mercredi 12 mars 2014, 22:59:39 (UTC+0100)

I logon the node, check the log file :
[2014-03-12 21:59:41,927][INFO ][node ] [Node ES #1]
version[0.90.12], pid[1129], build[26feed7/2014-02-25T15:38:23Z]
[2014-03-12 21:59:41,928][INFO ][node ] [Node ES #1]
initializing ...
[2014-03-12 21:59:41,944][INFO ][plugins ] [Node ES #1]
loaded [], sites []
[2014-03-12 21:59:47,262][INFO ][node ] [Node ES #1]
initialized
[2014-03-12 21:59:47,263][INFO ][node ] [Node ES #1]
starting ...
[2014-03-12 21:59:47,485][INFO ][transport ] [Node ES #1]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/
172.16.0.100:9300]}
[2014-03-12 21:59:57,573][INFO ][cluster.service ] [Node ES #1]
new_master [Node ES #1][LbMQazWXR9uB6Q7R2xLxGQ][inet[/172.16.0.100:9300]]{master=true},
reason: zen-disco-join (elected_as_master)
[2014-03-12 21:59:57,657][INFO ][discovery ] [Node ES #1]
logstash/LbMQazWXR9uB6Q7R2xLxGQ
[2014-03-12 21:59:57,733][INFO ][http ] [Node ES #1]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/
172.16.0.100:9200]}
[2014-03-12 21:59:57,735][INFO ][node ] [Node ES #1]
started
[2014-03-12 21:59:59,569][INFO ][gateway ] [Node ES #1]
recovered [2] indices into cluster_state

Then I start the second node :
root@lada:/var/lib/lxc/kibana# date && lxc-start -n es_node2 -d
mercredi 12 mars 2014, 23:02:59 (UTC+0100)

Logon on the second node, and open the log :
[2014-03</span

...

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f201245b-aade-47b8-8b5e-1ef256451a8b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #14