Handshake failed, unexpected remote node (IP/port changes)

I am trying to change my elasticsearch.yml config so elasticsearch uses a TCP tunnel between my servers instead of a direct connection. I managed this without problems with MariaDB.

For this I am attempting to just change the following:

server1:

network.host: 192.168.10.160
transport.tcp.port: 9984
discovery.zen.ping.unicast.hosts: ["192.168.10.160:9984", "192.168.10.161:9984"]

server2:

network.host: 192.168.10.161
transport.tcp.port: 9984
discovery.zen.ping.unicast.hosts: ["192.168.10.160:9984", "192.168.10.161:9984"]

to the following on both servers:

network.host: 127.0.0.1
transport.tcp.port: 9984
discovery.zen.ping.unicast.hosts: ["127.0.0.1:9984", "127.0.0.1:9982"]

127.0.0.1:9982 is the TCP tunnel, leading to 127.0.0.1:9984 on the other server. So

127.0.0.1:9984 = always the current server
127.0.0.1:9982 = other server via TCP tunnel

The one server starts up fine:

[2018-01-29T01:50:52,214][INFO ][o.e.t.TransportService   ] [server2] publish_address {127.0.0.1:9784}, bound_addresses {127.0.0.1:9784}
[2018-01-29T01:50:52,333][WARN ][o.e.t.n.Netty4Transport  ] [server2] send message failed [channel: org.elasticsearch.transport.netty4.NettyTcpChannel@28f15758]
java.nio.channels.ClosedChannelException: null
	at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2018-01-29T01:50:55,317][INFO ][o.e.c.s.MasterService    ] [server2] zen-disco-elected-as-master ([0] nodes joined), reason: new_master {server2}{Cl_vBCJrQumobozWnGkyHA}{lPqBzF_USqS4tUfVdyVXOg}{127.0.0.1}{127.0.0.1:9784}
[2018-01-29T01:50:55,321][INFO ][o.e.c.s.ClusterApplierService] [server2] new_master {server2}{Cl_vBCJrQumobozWnGkyHA}{lPqBzF_USqS4tUfVdyVXOg}{127.0.0.1}{127.0.0.1:9784}, reason: apply cluster state (from master [master {server2}{Cl_vBCJrQumobozWnGkyHA}{lPqBzF_USqS4tUfVdyVXOg}{127.0.0.1}{127.0.0.1:9784} committed version [1] source [zen-disco-elected-as-master ([0] nodes joined)]])
[2018-01-29T01:50:55,346][INFO ][o.e.h.n.Netty4HttpServerTransport] [server2] publish_address {127.0.0.1:9683}, bound_addresses {127.0.0.1:9683}
[2018-01-29T01:50:55,347][INFO ][o.e.n.Node               ] [server2] started
[2018-01-29T01:50:56,009][INFO ][o.e.g.GatewayService     ] [server2] recovered [7] indices into cluster_state
[2018-01-29T01:50:56,341][INFO ][o.e.c.r.a.AllocationService] [server2] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[aaproducts_1510450266][0]] ...]).

while the other server floods my elasticsearch logs with:

[2018-01-29T01:51:49,218][WARN ][o.e.d.z.ZenDiscovery     ] [server1] failed to connect to master [{server2}{Cl_vBCJrQumobozWnGkyHA}{lPqBzF_USqS4tUfVdyVXOg}{127.0.0.1}{127.0.0.1:9784}], retrying...
org.elasticsearch.transport.ConnectTransportException: [server2][127.0.0.1:9784] handshake failed. unexpected remote node {server1}{ldNkwVs5RFSebpa3au2FdA}{sV9XkuK2SQ-9ZBc3UmRORA}{127.0.0.1}{127.0.0.1:9784}
	at org.elasticsearch.transport.TransportService.lambda$connectToNode$3(TransportService.java:331) ~[elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:514) ~[elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:327) ~[elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:314) ~[elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.discovery.zen.ZenDiscovery.joinElectedMaster(ZenDiscovery.java:515) [elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.discovery.zen.ZenDiscovery.innerJoinCluster(ZenDiscovery.java:483) [elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.discovery.zen.ZenDiscovery.access$2500(ZenDiscovery.java:90) [elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.discovery.zen.ZenDiscovery$JoinThreadControl$1.run(ZenDiscovery.java:1253) [elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:568) [elasticsearch-6.1.2.jar:6.1.2]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_111]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_111]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111]

So server2 starts up, it detects itself, and elects itself the master (discovery.zen.minimum_master_nodes is set to 1 on both nodes). server1 starts up, wants to connect to the master which it thinks is its own IP & port combination (which doesn't make sense anyway), notices that IP&port combination is not the other master, is confused that server1 != server2, floods the logs, and does not elect itself the master, so that elasticsearch never properly starts up on that node.

I don't quite understand why this would happen. With the tunnel inbetween, it is maybe harder for Elasticsearch to detect if the connection can be made or not, so on the server which works, this error message is logged:

[2018-01-29T02:32:45,183][WARN ][o.e.t.n.Netty4Transport  ] [server2] send message failed [channel: org.elasticsearch.transport.netty4.NettyTcpChannel@6ed323b7]
java.nio.channels.ClosedChannelException: null
	at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]

Yet one of the two servers always goes crazy, doesn't "see itself" and expects the other server there, while the TCP proxy between them seems to work fine.

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.