Node discovery broken?


(Lukáš Vlček) #1

Hi,

I am facing strange node discovery issues.
I have two nodes (different machines with different IP address: 192.168.2.2
and 192.168.2.3). If I unpack recent master release zip and start
elasticsearch -f on both machines (no changes to elasticsearch.yml) then
depending on the order in which I start the processes I get different
results.

Put it simply:
No matter if I first start ES on node A or B nodes do not discover each
other. If I start on node A first then I get exception on it when starting
node B.

Attached are log files:
usecase #1: node A was started first (N'Gabthoth), then node B was started
(Destroyer - what a perfect name!) and node A got some exceptions... then I
shutdown both nodes.
usecase #2: node B was started first (Blink) then node A was started
(Toxin). No exceptions in any node log but still does not discover each
other... shutdown both nodes.

Whay am I getting exceptions in usecase #1?
Do I have to configure nodes to make then discover each other?

Regards,
Lukas


(nfo) #2

Could be because you have multiple network interfaces (parallels?
vmware?).

Try adding the local IP you want to use in each config/
elasticsearch.yml:

network:
host: 192.168.2.2

You should also be able to set you network interface name (like
en0), but there seem to be a bug with that, at least with Leopard
http://github.com/elasticsearch/elasticsearch/issues/#issue/214 .
Would be nice to tell if setting this works for you (and update
comment on the issue in github)

On Jun 8, 1:05 am, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

I am facing strange node discovery issues.
I have two nodes (different machines with different IP address: 192.168.2.2
and 192.168.2.3). If I unpack recent master release zip and start
elasticsearch -f on both machines (no changes to elasticsearch.yml) then
depending on the order in which I start the processes I get different
results.

Put it simply:
No matter if I first start ES on node A or B nodes do not discover each
other. If I start on node A first then I get exception on it when starting
node B.

Attached are log files:
usecase #1: node A was started first (N'Gabthoth), then node B was started
(Destroyer - what a perfect name!) and node A got some exceptions... then I
shutdown both nodes.
usecase #2: node B was started first (Blink) then node A was started
(Toxin). No exceptions in any node log but still does not discover each
other... shutdown both nodes.

Whay am I getting exceptions in usecase #1?
Do I have to configure nodes to make then discover each other?

Regards,
Lukas

elasticsearch-1A-NGabthoth.log
33KViewDownload

elasticsearch-1B-Destroyer.log
14KViewDownload

elasticsearch-2A-Toxin.log
12KViewDownload

elasticsearch-2B-Blink.log
13KViewDownload


(Clinton Gormley) #3

On Wed, 2010-06-09 at 01:03 -0700, nfo wrote:

Could be because you have multiple network interfaces (parallels?
vmware?).

Try adding the local IP you want to use in each config/
elasticsearch.yml:

network:
host: 192.168.2.2

This seems to be a frequent issue in recent releases. The problem seems
to arise as follows:

  1. The server has multiple IP addresses, eg:
    1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 brd 127.255.255.255 scope host lo
    inet 127.0.0.2/8 brd 127.255.255.255 scope host secondary lo
    2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether d8:d3:85:a3:33:a4 brd ff:ff:ff:ff:ff:ff
    inet 192.168.10.50/24 brd 192.168.10.255 scope global eth0
    3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether d8:d3:85:a3:33:a6 brd ff:ff:ff:ff:ff:ff
    inet 192.168.50.50/24 brd 192.168.50.255 scope global eth1

  2. The default config in ES is to bind to 0.0.0.0, ie all addresses, so
    in this case it binds to localhost, 192.168.10.50 and 192.168.50.50

  3. Then ES has to choose ONE address to use as the "publish" address, which
    is the address that other nodes choose to connect to. In this case it
    happens to choose 192.168.50.50

  4. Another node is started, gets the .50.50 address, tries to connect
    and because its default gateway is 192.168.10.1, it fails

The solution is to specify either which IP address it should bind to, or
which IP address it should use as the publish address, eg:

network:
host: 192.168.10.50

or
network:
publish_host: 192.168.10.50

The latter method has the advantage that ES still binds to localhost as
well, but is providing the correct address to other nodes.

An alternative is to say:

network:
publish_host: en0

which binds to the first ethernet interface.

I'm wondering if there should also be the option:

network:
   publish_host: 192.168.10.0/24

which would only bind to an IP address that falls into that subnet?

clint


(system) #4