Elasticsearch 2.4 nodes does not form cluster with ConnectTransportException

I am already running ELK stack with Elasticsearch(ES) 1.7 with docker container with 3 nodes, each running one ES container, running behind nginx server. Now I am trying to upgrade ES to 2.4.0. Root user is not allowed in ES 2.4.0 so I am using -Des.root.insecure.allow=true option.

Configuration file will be modified as follows:

#Performance optimization settings
echo "index.number_of_replicas: 1" >> ${ES_CONFIG_PATH}/elasticsearch.yml
echo "index.number_of_shards: 3" >> ${ES_CONFIG_PATH}/elasticsearch.yml
#echo "discovery.zen.ping.multicast.enabled: false" >> ${ES_CONFIG_PATH}/elasticsearch.yml
#echo "bootstrap.mlockall: true" >> ${ES_CONFIG_PATH}/elasticsearch.yml
#echo "indices.memory.index_buffer_size: 50%" >> ${ES_CONFIG_PATH}/elasticsearch.yml


#publish host as container host address
#echo "network.publish_host: ${CONTAINER_HOST_ADDRESS}" >> ${ES_CONFIG_PATH}/elasticsearch.yml
#echo "network.bind_host: ${CONTAINER_HOST_ADDRESS}" >> ${ES_CONFIG_PATH}/elasticsearch.yml
#echo "network.publish_host: ${CONTAINER_PRIVATE_IP}" >> ${ES_CONFIG_PATH}/elasticsearch.yml
#echo "network.bind_host: ${CONTAINER_PRIVATE_IP}" >> ${ES_CONFIG_PATH}/elasticsearch.yml
#echo "network.host: ${CONTAINER_HOST_ADDRESS}" >> ${ES_CONFIG_PATH}/elasticsearch.yml
echo "network.host: 0.0.0.0" >> ${ES_CONFIG_PATH}/elasticsearch.yml
#echo "htpp.port: 9200" >> ${ES_CONFIG_PATH}/elasticsearch.yml
#echo "transport.tcp.port: 9300-9400" >> ${ES_CONFIG_PATH}/elasticsearch.yml
#configure elasticsearch.yml for clustering
echo 'discovery.zen.ping.unicast.hosts: [ELASTICSEARCH_IPS] ' >> ${ES_CONFIG_PATH}/elasticsearch.yml
echo "discovery.zen.minimum_master_nodes: 1" >> ${ES_CONFIG_PATH}/elasticsearch.yml

ELASTICSEARCH_IPS is array of IPs of other nodes, which is obtained by all nodes running a script called query-crs-es.sh. Eventually Array will have IPs of other two nodes of cluster. Please note they will be node's IP, not container private IPs.

When ever I try to run the container I use ansible. During start up, all nodes get up but failed to form cluster. I consistently get these error
Node 1 starts withour any problem, gets elected as master, name is Dragon Lord.

Node2:

[2016-10-07 09:45:58,561][WARN ][bootstrap                ] running as ROOT user. this is a bad idea!
[2016-10-07 09:45:58,729][INFO ][node                     ] [Defensor] version[2.4.0], pid[1], build[ce9f0c7/2016-08-29T09:14:17Z]
[2016-10-07 09:45:58,729][INFO ][node                     ] [Defensor] initializing ...
[2016-10-07 09:45:59,215][INFO ][plugins                  ] [Defensor] modules [reindex, lang-expression, lang-groovy], plugins [], sites []
[2016-10-07 09:45:59,237][INFO ][env                      ] [Defensor] using [1] data paths, mounts [[/data (/dev/mapper/platform-data)]], net usable_space [2.5tb], net total_space [2.5tb], spins? [possibly], types [xfs]
[2016-10-07 09:45:59,237][INFO ][env                      ] [Defensor] heap size [989.8mb], compressed ordinary object pointers [true]
[2016-10-07 09:45:59,266][WARN ][threadpool               ] [Defensor] requested thread pool size [60] for [index] is too large; setting to maximum [32] instead
[2016-10-07 09:46:00,733][INFO ][node                     ] [Defensor] initialized
[2016-10-07 09:46:00,733][INFO ][node                     ] [Defensor] starting ...
[2016-10-07 09:46:00,833][INFO ][transport                ] [Defensor] publish_address {172.17.0.16:9300}, bound_addresses {[::]:9300}
[2016-10-07 09:46:00,837][INFO ][discovery                ] [Defensor] ccs-elasticsearch/RXALMe9NQVmbCz5gg1CwHA
[2016-10-07 09:46:03,876][WARN ][discovery.zen            ] [Defensor] failed to connect to master [{Dragon Lord}{5wNwWJRFRS-2dRY5AGqqGQ}{172.17.0.15}{172.17.0.15:9300}], retrying...
ConnectTransportException[[Dragon Lord][172.17.0.15:9300] connect_timeout[30s]]; nested: ConnectException[Connection refused: /172.17.0.15:9300];
	at org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:1002)
	at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:937)
Caused by: java.net.ConnectException: Connection refused: /172.17.0.15:9300
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
	at org.jboss.netty.channel.socket.nio.NioClientBoss.connect(NioClientBoss.java:152)

Node3 have similar logs.
As you can see from logs, Node 2 and 3 are aware of master, Node1, but unable to connect. I have tried most of the configurations about network.host which you can see commented in configuration code and neither of them work.
Any leads will be appreciated.

Hi,

Can you reach Node 1 from Node 2 using the same hosts:ports you configured? It looks like the connection cannot be established between the Docker containers. I suppose some port redirection must be configured in your Docker file.

  1. From within Elasticsearch container on any node, curl to each nodeIp:9200 is reachable and gives correct response.
  2. From within Elasticsearch container on any node, curl to localhost:9200 is reachable and gives correct response.
  3. From any node, outside the container, curl to each nodeIp:9200 is reachable and gives correct response.

In Dockerfile, I have exposed 9200 and 9300 ports using EXPOSE command. While running containers, Ansible exposes both ports too.
Additionally, if I do curl -XGET $(hostname -i):9200/_cat/nodes ( curl to same node with IP from hostname utility), only master gives correct response. i.e. it gives response that only one node is there in cluster with node's ES name. Other 2 nodes say

'{
  "error": {
    "root_cause": [
      {
        "type": "master_not_discovered_exception",
        "reason": null
      }
    ],
    "type": "master_not_discovered_exception",
    "reason": null
  },
  "status": 503
}'

Also the master gave this as one of the logs, which has publish_host and bind_host different
[2016-10-11 08:57:21,656][INFO ][transport ] [Betty Ross Banner] publish_address {172.17.0.15:9300}, bound_addresses {[::]:9300}
I have this configuration for both fields, as mentioned earlier

echo "network.publish_host: 0.0.0.0" >> ${ES_CONFIG_PATH}/elasticsearch.yml
echo "network.bind_host: 0.0.0.0" >> ${ES_CONFIG_PATH}/elasticsearch.yml

This might be the problem? I have tried option of network.host: 0.0.0.0
Please ignore the ES name. This is fresh but failed installation, so master's name is changed.

Replacing

echo "network.host: _site_" >> ${ES_CONFIG_PATH}/elasticsearch.yml
echo "transpost_publish.host: _site_" >> ${ES_CONFIG_PATH}/elasticsearch.yml
echo "transpost_bind.host: _site_" >> ${ES_CONFIG_PATH}/elasticsearch.yml
echo "transport_publish.port: 9300" >> ${ES_CONFIG_PATH}/elasticsearch.yml

Got me
[2016-10-11 12:02:29,618][INFO ][transport ] [Ultragirl] publish_address {172.17.0.15:9300}, bound_addresses {172.17.0.15:9300}
This log which is different than mentioned above. But problem persists.

I was able to form cluster with following settings
network.publish_host=CONTAINER_HOST_ADDRESS i.e. address of node where the container is running.
network.bind_host=0.0.0.0
transport.publish_port=9300
transport.publish_host=CONTAINER_HOST_ADDRESS