Unicast discovery

Hi,

I'm having problems getting unicast discovery working between two machines (one is a Linode, one is an EC2 instance). The EC2 ES instance seems to be able to talk to the Linode, but a timeout occurs somewhere preventing it joining the cluster.

First up: versions. This is ES 0.19.8, running on Ubuntu 12.04LTS, OpenJDK reports:

java version "1.6.0_24"
OpenJDK Runtime Environment (IcedTea6 1.11.5) (6b24-1.11.5-0ubuntu1~12.04.1)
OpenJDK Client VM (build 20.0-b12, mixed mode, sharing)

(I know those aren't ideal - that's a separate project!)

Here's a gist of the non default configs and the logs:

https://gist.github.com/4547184

I've also set discovery logging to TRACE in logging.yml.

I start the Linode first, and once it's started, I start the EC2 instance.

Does anyone know what might cause that timeout in the EC2 instance, preventing it from joining the cluster?

Cheers,
Dan

Dan Fairs | dan.fairs@gmail.com | @danfairs | secondsync.com

--

To add some more information to this - I suspect it's to do with the fact that EC2 machines have an internal address and an external address. More detail...

I'm having problems getting unicast discovery working between two machines (one is a Linode, one is an EC2 instance). The EC2 ES instance seems to be able to talk to the Linode, but a timeout occurs somewhere preventing it joining the cluster.

I tried swapping the unicast host lists around, so that the EC2 instance is what both tried to connect to. It still didn't work, but I see messages like this in the Linode node's logs:

2013-01-16 13:57:22,189][TRACE][discovery.zen.ping.unicast] [Baroness Blood] [2] sending to [#zen_unicast_1#][inet[/54.247.0.254:9700]]
[2013-01-16 13:57:22,207][TRACE][discovery.zen.ping.unicast] [Baroness Blood] [2] received response from [#zen_unicast_1#][inet[/54.247.0.254:9700]]: [ping_response{target [[Gideon, Gregory][OEaqkM5NR5yK5pgMS9nJ_g][inet[/10.33.160.162:9700]]], master [null], cluster_name[staging-es]}, ping_response{target [[Baroness Blood][SqVsSWx3Qi27GfMgiQINfA][inet[/176.58.126.151:9700]]], master [null], cluster_name[staging-es]}, ping_response{target [[Baroness Blood][SqVsSWx3Qi27GfMgiQINfA][inet[/176.58.126.151:9700]]], master [null], cluster_name[staging-es]}, ping_response{target [[Gideon, Gregory][OEaqkM5NR5yK5pgMS9nJ_g][inet[/10.33.160.162:9700]]], master [null], cluster_name[staging-es]}, ping_response{target [[Baroness Blood][SqVsSWx3Qi27GfMgiQINfA][inet[/176.58.126.151:9700]]], master [null], cluster_name[staging-es]}, ping_response{target [[Gideon, Gregory][OEaqkM5NR5yK5pgMS9nJ_g][inet[/10.33.160.162:9700]]], master [null], cluster_name[staging-es]}, ping_response{target [[Baroness Blood][SqVsSWx3Qi27GfMgiQINfA][inet[/176.58.126.151:9700]]], master [null], cluster_name[staging-es]}, ping_response{target [[Gideon, Gregory][OEaqkM5NR5yK5pgMS9nJ_g][inet[/10.33.160.162:9700]]], master [[Gideon, Gregory][OEaqkM5NR5yK5pgMS9nJ_g][inet[/10.33.160.162:9700]]], cluster_name[staging-es]}]
[2013-01-16 13:57:22,208][TRACE][discovery.zen.ping.unicast] [Baroness Blood] [2] disconnecting from [#zen_unicast_1#][inet[/54.247.0.254:9700]]
[2013-01-16 13:57:22,208][TRACE][discovery.zen ] [Baroness Blood] full ping responses:
--> target [[Gideon, Gregory][OEaqkM5NR5yK5pgMS9nJ_g][inet[/10.33.160.162:9700]]], master [[Gideon, Gregory][OEaqkM5NR5yK5pgMS9nJ_g][inet[/10.33.160.162:9700]]]

To my uneducated eye, it appears that the EC2 ES instance is reporting its internal IP address rather than the externally accessible one. The Linode instance then goes on to log:

2013-01-16 13:57:52,235][WARN ][discovery.zen ] [Baroness Blood] failed to connect to master [[Gideon, Gregory][OEaqkM5NR5yK5pgMS9nJ_g][inet[/10.33.160.162:9700]]], retrying...
org.elasticsearch.transport.ConnectTransportException: [Gideon, Gregory][inet[/10.33.160.162:9700]] connect_timeout[30s]
at org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:563)
at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:505)
at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:483)
at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:128)
at org.elasticsearch.discovery.zen.ZenDiscovery.innterJoinCluster(ZenDiscovery.java:326)
at org.elasticsearch.discovery.zen.ZenDiscovery.access$500(ZenDiscovery.java:75)
at org.elasticsearch.discovery.zen.ZenDiscovery$1.run(ZenDiscovery.java:280)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)
Caused by: java.net.ConnectException: connection timed out
at org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.processConnectTimeout(NioClientSocketPipelineSink.java:391)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.run(NioClientSocketPipelineSink.java:289)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:102)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
... 3 more

Clearly, it's trying to connect directly to that internal IP address.

Is this configuration possible, trying to join a machine outside EC2 to a machine inside EC2? I guess I could tunnel the connection over SSH if I had to.

Cheers,
Dan

Dan Fairs | dan.fairs@gmail.com | @danfairs | secondsync.com

--

Your analysis is correct. After initial unicast discovery your linode is
trying to connect to EC2 using the internal address that EC2 node
"published". You can override this address using network.publish_hosthttp://www.elasticsearch.org/guide/reference/modules/network.htmlsetting. It might be also a good idea to increase ping timeouts on your
nodes considering that they will be located in different data centers.

On Wednesday, January 16, 2013 9:04:41 AM UTC-5, Dan Fairs wrote:

To add some more information to this - I suspect it's to do with the fact
that EC2 machines have an internal address and an external address. More
detail...

I'm having problems getting unicast discovery working between two machines
(one is a Linode, one is an EC2 instance). The EC2 ES instance seems to be
able to talk to the Linode, but a timeout occurs somewhere preventing it
joining the cluster.

I tried swapping the unicast host lists around, so that the EC2 instance
is what both tried to connect to. It still didn't work, but I see messages
like this in the Linode node's logs:

2013-01-16 13:57:22,189][TRACE][discovery.zen.ping.unicast] [Baroness
Blood] [2] sending to [#zen_unicast_1#][inet[/54.247.0.254:9700]]
[2013-01-16 13:57:22,207][TRACE][discovery.zen.ping.unicast] [Baroness
Blood] [2] received response from
[#zen_unicast_1#][inet[/54.247.0.254:9700]]: [ping_response{target
[[Gideon, Gregory][OEaqkM5NR5yK5pgMS9nJ_g][inet[/10.33.160.162:9700]]],
master [null], cluster_name[staging-es]}, ping_response{target [[Baroness
Blood][SqVsSWx3Qi27GfMgiQINfA][inet[/176.58.126.151:9700]]], master [null],
cluster_name[staging-es]}, ping_response{target [[Baroness
Blood][SqVsSWx3Qi27GfMgiQINfA][inet[/176.58.126.151:9700]]], master [null],
cluster_name[staging-es]}, ping_response{target [[Gideon,
Gregory][OEaqkM5NR5yK5pgMS9nJ_g][inet[/10.33.160.162:9700]]], master
[null], cluster_name[staging-es]}, ping_response{target [[Baroness
Blood][SqVsSWx3Qi27GfMgiQINfA][inet[/176.58.126.151:9700]]], master [null],
cluster_name[staging-es]}, ping_response{target [[Gideon,
Gregory][OEaqkM5NR5yK5pgMS9nJ_g][inet[/10.33.160.162:9700]]], master
[null], cluster_name[staging-es]}, ping_response{target [[Baroness
Blood][SqVsSWx3Qi27GfMgiQINfA][inet[/176.58.126.151:9700]]], master [null],
cluster_name[staging-es]}, ping_response{target [[Gideon,
Gregory][OEaqkM5NR5yK5pgMS9nJ_g][inet[/10.33.160.162:9700]]], master
[[Gideon, Gregory][OEaqkM5NR5yK5pgMS9nJ_g][inet[/10.33.160.162:9700]]],
cluster_name[staging-es]}]
[2013-01-16 13:57:22,208][TRACE][discovery.zen.ping.unicast] [Baroness
Blood] [2] disconnecting from [#zen_unicast_1#][inet[/54.247.0.254:9700]]
[2013-01-16 13:57:22,208][TRACE][discovery.zen ] [Baroness
Blood] full ping responses:
--> target [[Gideon,
Gregory][OEaqkM5NR5yK5pgMS9nJ_g][inet[/10.33.160.162:9700]]], master
[[Gideon, Gregory][OEaqkM5NR5yK5pgMS9nJ_g][inet[/10.33.160.162:9700]]]

To my uneducated eye, it appears that the EC2 ES instance is reporting its
internal IP address rather than the externally accessible one. The Linode
instance then goes on to log:

2013-01-16 13:57:52,235][WARN ][discovery.zen ] [Baroness
Blood] failed to connect to master [[Gideon,
Gregory][OEaqkM5NR5yK5pgMS9nJ_g][inet[/10.33.160.162:9700]]], retrying...
org.elasticsearch.transport.ConnectTransportException: [Gideon,
Gregory][inet[/10.33.160.162:9700]] connect_timeout[30s]
at
org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:563)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:505)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:483)
at
org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:128)
at
org.elasticsearch.discovery.zen.ZenDiscovery.innterJoinCluster(ZenDiscovery.java:326)
at
org.elasticsearch.discovery.zen.ZenDiscovery.access$500(ZenDiscovery.java:75)
at
org.elasticsearch.discovery.zen.ZenDiscovery$1.run(ZenDiscovery.java:280)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)
Caused by: java.net.ConnectException: connection timed out
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.processConnectTimeout(NioClientSocketPipelineSink.java:391)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.run(NioClientSocketPipelineSink.java:289)
at
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:102)
at
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
... 3 more

Clearly, it's trying to connect directly to that internal IP address.

Is this configuration possible, trying to join a machine outside EC2 to a
machine inside EC2? I guess I could tunnel the connection over SSH if I had
to.

Cheers,
Dan

Dan Fairs | dan....@gmail.com <javascript:> | @danfairs | secondsync.com

--

Hi Igor,

On 16 Jan 2013, at 14:35, Igor Motov imotov@gmail.com wrote:

Your analysis is correct. After initial unicast discovery your linode is trying to connect to EC2 using the internal address that EC2 node "published". You can override this address using network.publish_host setting. It might be also a good idea to increase ping timeouts on your nodes considering that they will be located in different data centers.

Aha - I hadn't discovered that setting. Thanks. I'd already increased the ping times, anticipating that problem. I'll give that a go.

I also chatted to Clint Gormley on IRC yesterday, and he suggested a simple rsync approach for the data/ directory as a (simpler!) alternative.

Cheers,
Dan

Dan Fairs | dan.fairs@gmail.com | @danfairs | secondsync.com

--