I'm trying to run an elasticsearch cluster with each es-node running in its own container. These containers are deployed using ECS across several machines that may be running other unrelated containers. To avoid port conflicts each port a container exposes is assigned a random value. These random ports are consistent across all running containers of the same type. In other words, all running es-node containers map port 9300 to the same random number.
Here's the config I'm using:
network:
host: 0.0.0.0
plugin:
mandatory: cloud-aws
cluster:
name: ${ES_CLUSTER_NAME}
discovery:
type: ec2
ec2:
groups: ${ES_SECURITY_GROUP}
any_group: false
zen.ping.multicast.enabled: false
transport:
tcp.port: 9300
publish_port: ${_INSTANCE_PORT_TRANSPORT}
cloud.aws:
access_key: ${AWS_ACCESS_KEY}
secret_key: ${AWS_SECRET_KEY}
region: ${AWS_REGION}
In this case _INSTANCE_PORT_TRANSPORT
is the port that 9300 is bound to on the host machine. I've confirmed that all the environment variables used above are set correctly. I'm also setting network.publish_host
to the host machine's local IP via a command line arg.
When I forced _INSTANCE_PORT_TRANSPORT
(and in turn transport.publish_port
) to be 9300, everything worked great, but as soon as it's given a random value, nodes can no longer connect to each other. I see errors like this using logger.discovery=TRACE
:
ConnectTransportException[[][10.0.xxx.xxx:9300] connect_timeout[30s]]; nested: ConnectException[Connection refused: /10.0.xxx.xxx:9300];
at org.elasticsearch.transport.netty.NettyTransport.connectToChannelsLight(NettyTransport.java:952)
at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:916)
at org.elasticsearch.transport.netty.NettyTransport.connectToNodeLight(NettyTransport.java:888)
at org.elasticsearch.transport.TransportService.connectToNodeLight(TransportService.java:267)
at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$3.run(UnicastZenPing.java:395)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
It seems like the port a node binds to is the same as the port it pings while trying to connect to other nodes. Is there any way to make them different? If not, what's the point of transport.publish_port
?
Here are the full logs: https://gist.github.com/xavi-/6ecc4ba16b39680fb28c8fb25307bcc7