Transport Client can't connect to AWS EC2 Cluster

I have an ES 2.3.2 cluster configured and running in AWS EC2 (VPC). I've opened up both the REST and Transport ports in the security group. I want to be able to connect a TransportClient to the remote cluster running AWS EC2 but it can never seem to connect

tl;dr; To cluster properly publish_host has to be the EC2 Internal Ip so the hosts can cluster within the VPC; External to VPC the internal ip address is unreachable; but the TransportClient seems to only connect if the addedTransportAddress matches the publish_host.

Ports are correct and open with connection to host from external

I'm able to curl against both ports the rest port returns as expected

{
  "name" : "NODE_NAME",
  "cluster_name" : "CLUSTER_NAME",
  "version" : {
    "number" : "2.3.2",
    "build_hash" : "b9e4a6acad4008027e4038f6abed7f7dba346f94",
    "build_timestamp" : "2016-04-21T16:03:47Z",
    "build_snapshot" : false,
    "lucene_version" : "5.5.0"
  },
  "tagline" : "You Know, for Search"
}


When I curl against the Transport port it is connecting to the cluster but obviously does not serve HTTP traffic and returns the following message:

 This is not a HTTP port


Yet whenever I attempt to initialize a TransportClient using the following configuration it has no available nodes:

Settings.settingsBuilder()
        .put(ELASTICSEARCH_CLIENT_TRANSPORT_SNIFF_KEY, false)
        .put(ELASTICSEARCH_CLIENT_TRANSPORT_IGNORE_CLUSTER_NAME_KEY, true)
        .put(ELASTICSEARCH_CLIENT_TRANSPORT_PING_TIMEOUT_KEY, "30s")
        .put(ELASTICSEARCH_CLIENT_TRANSPORT_NODES_SAMPLER_INTERVAL_KEY, "30s")
        .build()

...

transportClient.addTransportAddress(EC2_INSTANCE_PUBLIC_IP)

I am using the EC2 Discovery mechanism
Pertinent config section(s)

transport.tcp.port: 8193
transport.tcp.compress: true
http.compression: true 
http.cors.enabled: true
http.port: 8192

discovery.type: ec2
discovery.ec2.tag.ElasticSearch: DeviceProfileLookup

network.host: ["_site_"]
network.bind_host: 0.0.0.0

I've tried setting network.host to ["_ec2:publicIp_", "_ec2:privateIp_"] which then prevents the cluster from clustering on startup.

It sounds like the TransportClient is only able to connect to the cluster if the address used is the same as the publish_host. When I tried setting the publish_host to _ec2:publicIp_, the TransportClient was able to connect, but then then the hosts that live in EC2 are unable to connect to each other.

Any insight or advice would be much appreciated.

Thanks.

Which error do you get with this?

If you set the network.host to public Ip you need to set discovery.ec2.host_type to public_ip

When I use the following config:

plugin.mandatory: cloud-aws
discovery.type: ec2
discovery.ec2.host_type: public_ip
network.publish_host: ["_ec2:publicIp_"]
network.bind_host: 0.0.0.0

Looking at the logs the data node doesn't even find the master node to try to connect to (at least i don't see any Timeouts) and when I query for health it reports it has no known master.

Log Snippet
    [2016-07-14 20:53:51,744][WARN ][rest.suppressed          ] /_cluster/health Params: {}
    MasterNotDiscoveredException[null]
            at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$5.onTimeout(TransportMasterNodeAction.java:226)
            at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:236)
            at org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:804)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
            at java.lang.Thread.run(Thread.java:745)
    [2016-07-14 20:53:56,773][DEBUG][action.admin.cluster.health] [DATA_NODE_NAME] no known master node, scheduling a retry
    [2016-07-14 20:53:56,773][DEBUG][action.admin.cluster.health] [DATA_NODE_NAME] timed out while retrying [cluster:monitor/health] after failure (timeout [30s])

When I use the following config:

plugin.mandatory: cloud-aws
discovery.type: ec2
discovery.ec2.host_type: private_ip
network.publish_host: ["_site_", "_ec2:publicIp_", "_ec2:privateIp_"]
network.bind_host: 0.0.0.0

The data node discovers the master but requests to it time out.

Log Snippet
    [2016-07-14 20:55:07,233][WARN ][discovery.ec2            ] [DATA_NODE_NAME] failed to connect to master [{MASTER_NODE_NAME}{oPGzf5lbS2GaJDPZBJl3lw}{MASTER_PUBLIC_IP}{MASTER_PUBLIC_IP:PORT_NUMBER}{availability_zone=us-east-1b, data=false, master=true}], retrying...
    ConnectTransportException[[MASTER_NODE_NAME][MASTER_PUBLIC_IP:PORT_NUMBER] connect_timeout[30s]]; nested: ConnectTimeoutException[connection timed out: /MASTER_PUBLIC_IP:PORT_NUMBER];
            at org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:987)
            at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:920)
            at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:893)
            at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:260)
            at org.elasticsearch.discovery.zen.ZenDiscovery.joinElectedMaster(ZenDiscovery.java:434)
            at org.elasticsearch.discovery.zen.ZenDiscovery.innerJoinCluster(ZenDiscovery.java:386)
            at org.elasticsearch.discovery.zen.ZenDiscovery.access$4800(ZenDiscovery.java:91)
            at org.elasticsearch.discovery.zen.ZenDiscovery$JoinThreadControl$1.run(ZenDiscovery.java:1237)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
            at java.lang.Thread.run(Thread.java:745)
    Caused by: org.jboss.netty.channel.ConnectTimeoutException: connection timed out: /MASTER_PUBLIC_IP:PORT_NUMBER
            at org.jboss.netty.channel.socket.nio.NioClientBoss.processConnectTimeout(NioClientBoss.java:139)
            at org.jboss.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:83)
            at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
            at org.jboss.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42)
            at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
            at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
            ... 3 more
    [2016-07-14 20:55:09,554][DEBUG][action.admin.cluster.health] [DATA_NODE_NAME] no known master node, scheduling a retry
    [2016-07-14 20:55:10,862][DEBUG][action.admin.cluster.health] [DATA_NODE_NAME] no known master node, scheduling a retry

I don't know if you solve it (and sorry for the delay).

I wonder if you could try defining discovery.ec2.groups as well.
Also, define the cloud.aws.region.

Let me know if it fix your issue.

Also could you check that you can actually telnet from one machine to the other using the public IP address on port 9300? If not, you need to check all firewall settings and security groups.