Connection to remote Elasticsearch cluster fails

Hello all,

I'm still learning ElasticSearch so maybe I'm missing a key notion / parameter here.

Presentation :
So far I've installed two independent ES 6.6.0 clusters, on two different regions. Let's call them cluster1 and cluster2.

Each cluster has :

Except for performance tuning to be done everything seems to work as expected so far. :+1:

Adding a remote cluster :
Now I want to be able to query cluster2 from kibana.mysite.org, and get rid of kibana2.mysite.org.

To do so, from kibana1.mysite.org I go to :

Management -> Remote clusters -> Add a remote cluster

Then, I try to add cluster2 as a remote server (where X.X.X.X is cluster2 public IP) :

Name: cluster2 
Seed nodes: X.X.X.X:9300

It seems that the settings are updated, as I can see several lines like this in the logs :

[2019-03-04T16:37:40,627][INFO ][o.e.c.s.ClusterSettings  ] [jXovQsj] updating [cluster.remote.kek.seeds] from [[]] to [["X.X.X.X:9300"]]

The problem :
Unfortunately the remote connection status of cluster2 is stuck to "Not connected"

If I try to reach cluster2 from any cluster1 node :

[root@elasticsearch-master-68964797cb-xxfs2 elasticsearch]# curl -X GET "X.X.X.X:9300/_remote/info"
This is not an HTTP port

So I guess it's not a firewall problem.
In every nodes configuration I've set the following parameters :

    - name: cluster.remote.connect
      value: "true"
    - name: http.cors.enabled
      value: "true"
    - name: http.cors.allow-origin
      value: "*"
    - name: network.host
      value: " 0.0.0.0"

What am I missing here, maybe those parameters are not correctly set ? Or do I even try to connect to the right port number ?

Please let me know what additional informations you need, I'll be glad to provide them.

Are there any other lines in the logs indicating exceptions raised while cluster1 tries to talk to cluster2?

Not 100% sure as I crawl into logs from Kibana, but I don't find any relevant lines after the "updating" ones. I guess we can expect at least timeouts or transport errors...

I have an NGINX controller on cluster2 and I can see incoming TCP connections a few seconds after adding the remote cluster :

[05/Mar/2019:09:09:19 +0000]TCP20015781786.246
[05/Mar/2019:09:09:22 +0000]TCP20015781789.304
[05/Mar/2019:09:09:22 +0000]TCP20015811789.317
[05/Mar/2019:09:09:22 +0000]TCP20015811789.285
[05/Mar/2019:09:09:22 +0000]TCP20015781789.141
[05/Mar/2019:09:09:22 +0000]TCP20015781789.283

Then it's complete silence. Does the remote cluster2 need to open connections towards cluster1 on the 9300 port ?

Edit : I had a doubt so on cluster1 I've open incoming 9300 connections, connections are open both way. From cluster2 :

[root@elasticsearch-master-5b885d49f8-sknfr elasticsearch]# curl -X GET "cluster1ip:9300/_remote/info"
This is not an HTTP port

No change in behaviour when adding remote cluster.

To simplify a lot, I have built two very simple clusters (one node hosting both master and data role).

Still running into two separate Kubernetes clusters.

I found this logs on cluster1 when trying to add cluster2 :

[2019-03-18T11:06:51,781][INFO ][o.e.c.s.ClusterSettings  ] [RXBUmGo] updating [cluster.remote    .dev02.seeds] from [[]] to [["XXPUBLICIPXX:9300"]]
[2019-03-18T11:06:54,892][WARN ][o.e.t.RemoteClusterConnection] [RXBUmGo] fetching nodes from     external cluster [dev02] failed
org.elasticsearch.transport.ConnectTransportException: [0uRpOEp][10.244.0.11:9300]     connect_exception
	at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(    TcpTransport.java:1569) ~[elasticsearch-6.6.0.jar:6.6.0]
	at org.elasticsearch.action.ActionListener.lambda$toBiConsumer$2(ActionListener.java:99) ~    [elasticsearch-6.6.0.jar:6.6.0]
	at org.elasticsearch.common.concurrent.CompletableContext.lambda$addListener$0(    CompletableContext.java:42) ~[elasticsearch-core-6.6.0.jar:6.6.0]
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[    ?:?]
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.............
Caused by: io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to     host: 10.244.0.11/10.244.0.11:9300
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:779) ~[?:?]
	at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327)     ~[?:?]
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(    AbstractNioChannel.java:340) ~[?:?]
	... 6 more
Caused by: java.net.NoRouteToHostException: No route to host
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:779) ~[?:?]
	at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327)     ~[?:?]
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(    AbstractNioChannel.java:340) ~[?:?]
	... 6 more
[2019-03-18T11:06:54,902][WARN ][o.e.t.RemoteClusterService] [RXBUmGo] failed to update seed     list for cluster: dev02
org.elasticsearch.transport.ConnectTransportException: [0uRpOEp][10.244.0.11:9300]     connect_exception
	at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(    TcpTransport.java:1569) ~[elasticsearch-6.6.0.jar:6.6.0]
	at org.elasticsearch.action.ActionListener.lambda$toBiConsumer$2(ActionListener.java:99) ~    [elasticsearch-6.6.0.jar:6.6.0]
	at org.elasticsearch.common.concurrent.CompletableContext.lambda$addListener$0(    CompletableContext.java:42) ~[elasticsearch-core-6.6.0.jar:6.6.0]
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[    ?:?]
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.    java:837) ~[?:?]
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.    java:2088) ~[?:?]
	at org.elasticsearch.common.concurrent.CompletableContext.completeExceptionally(    CompletableContext.java:57) ~[elasticsearch-core-6.6.0.jar:6.6.0]
	at org.elasticsearch.transport.netty4.Netty4TcpChannel.lambda$new$1(Netty4TcpChannel.    java:72) ~[?:?]
	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511) ~[?:?]
	at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:504) ~[?:?    ]
	at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:483) ~[    ?:?]
	at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424) ~[?:?]
	at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:121) ~[?:?]
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(    AbstractNioChannel.java:327) ~[?:?]
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(    AbstractNioChannel.java:343) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:556) ~[?:?    ]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:510) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.    java:909) ~[?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to     host: 10.244.0.11/10.244.0.11:9300
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:779) ~[?:?]
	at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327)     ~[?:?]
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(    AbstractNioChannel.java:340) ~[?:?]
	... 6 more
Caused by: java.net.NoRouteToHostException: No route to host
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:779) ~[?:?]
	at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327)     ~[?:?]
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(    AbstractNioChannel.java:340) ~[?:?]
	... 6 more    

Seems to be a "no route to host" error. Any thoughts ?

This indicates some kind of connectivity issue. The node whose logs you've quoted can't access 10.244.0.11.

Thank you David ! I think I found what happens. From cluster1 :

kubectl exec -it -n elastic elasticsearch-5c496d7c5c-2hh5k -- /bin/bash
[root@elasticsearch-5c496d7c5c-2hh5k elasticsearch]# curl -X GET "cluster2ip:9300/_remote/info"
This is not an HTTP port

With curl there's no problem. When adding cluster2 as a remote cluster :

[2019-03-19T14:16:15,251][WARN ][o.e.t.RemoteClusterService] [RXBUmGo] failed to update seed list for cluster: dev2
org.elasticsearch.transport.ConnectTransportException: [0uRpOEp][10.244.0.11:9300] connect_exception

10.244.0.11 is the local IP from the cluster2 machine :

NAME                             READY   STATUS    RESTARTS   AGE   IP            NODE                       NOMINATED NODE
elasticsearch-5c496d7c5c-kp8fx   1/1     Running   0          28h   10.244.0.11   aks-nodepool1-16884760-0   <none>

It's like that during the connection process, cluster1 asks for cluster2 an IP address and cluster2 answers with its local IP.

Exploring the elasticsearch.yaml, I've found that network.host is set to 0.0.0.0, don't know why...

I'm a bit confused reading the networking guide : https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-network.html

My use case is :

cluster1 (10.244.0.24)
  |
  v
Nginx-controller (cluster1 Public Ip)
  ^
  | INTERNET
  v
Nginx-controller (cluster2 Public Ip)
   ^
   |
cluster2  (10.244.0.11)

So on cluster2 I have to set :

  • network.publish_host to xx.xx.xx.xx (cluster2 public IP), or maybe _global_ ?
  • and network.bind_host to 10.244.0.11 (its local IP), or maybe _site_ ?

Is that correct or did I misunderstood something more ?

Yes, that's exactly what happens. Technically it's the network.publish_host that cluster2 is answering with. It might work to set this to cluster2's public IP address, but this will cause a good deal of confusion if there are multiple nodes in this cluster, because network.publish_host is how the nodes contact each other within a cluster too.

I think the preferred approach is to configure your network so that the two clusters have a consistent view of their respective IP addresses, for example with a VPN.

It's possible that in cluster1 you could set <<REDACTED>> to the public IP address of cluster2. I found this (deliberately) undocumented setting in the source code, but it might do what you are looking for.

Thank you David ! cluster.remote.$REMOTE_CLUSTER_NAME.proxy seems to work. Any reason why this parameter isn't documented ? It's a lifesaver. :slightly_smiling_face:

I've set on cluster1's elasticsearch.yaml :

cluster.remote.eusdeves02.proxy: "publicip:9300"
cluster.remote.eusdeves02.seeds: "elasticsearch.elastic:9300"

elasticsearch.elastic:9300 is the Kubernetes service name of my Elastic node(s) on cluster2 side.

And voilà !

I have modified cluster1 and cluster2 so they have now 3 nodes each, and redeployed the whole infrastructure. Now I have to find why it still shows only 1 connected node instead of 3 (I guess).

2 Likes

When it was introduced, it was incompatible with TLS. However I think that is no longer the case, so I'll see if I can add it to the docs. The relevant PR is #33062 if you're interested.

I'm not sure about this, sorry.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Some of the advice I gave in this post is dangerous since it relies on an internal setting that doesn't behave as one might expect. I've therefore hidden this thread.