Cannot setup cluster of ES with Docker of 2 EC2 machines on 3 nodes

Im trying to setup 3 nodes on 2 machines as the following configuration.

following the doc - Install Elasticsearch with Docker | Elasticsearch Guide [7.15] | Elastic
(run this script on each machine)

services:
  es01:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.10.1
    container_name: es01
    environment:
      - node.name=es01
      - network.publish_host=10.2.0.38
      - cluster.name=my-cluster
      - discovery.seed_hosts=es02,es03
      - cluster.initial_master_nodes=es01,es02,es03
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
      - network.host=0.0.0.0
      - network.bind_host=0.0.0.0
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - vit01:/usr/share/elasticsearch/data
    ports:
      - 9200:9200
      - 9300:9300
    networks:
      - elastic
  es02:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.10.1
    container_name: es02
    environment:
      - node.name=es02
      - network.publish_host=10.2.0.38
      - cluster.name=my-cluster
      - discovery.seed_hosts=es01,es03
      - cluster.initial_master_nodes=es01,es02,es03
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
      - network.host=0.0.0.0
      - network.bind_host=0.0.0.0
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - vit02:/usr/share/elasticsearch/data
    networks:
      - elastic
  es03:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.10.1
    container_name: es03
    environment:
      - node.name=es03
      - network.publish_host=10.2.0.243
      - cluster.name=my-cluster
      - discovery.seed_hosts=es01,es02
      - cluster.initial_master_nodes=es01,es02,es03
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
      - network.host=0.0.0.0
      - network.bind_host=0.0.0.0
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - vit03:/usr/share/elasticsearch/data
    networks:
      - elastic

volumes:
  vit01:
    external: true
  vit02:
    external: true
  vit03:
    external: true

networks:
  elastic:
    driver: bridge

The error I got on both machines

es01    | {"type": "server", "timestamp": "2021-10-16T12:35:38,508Z", "level": "WARN", "component": "o.e.d.HandshakingTransportAddressConnector", "cluster.name": "bigid-elasticsearch-cluster", "node.name": "es01", "message": "[connectToRemoteMasterNode[172.24.0.2:9300]] completed handshake with [{es02}{1LkU4MjdQwa2j-hw9IaVlA}{Y6V0srCoTsqehcNbdy8F_g}{10.2.0.38}{10.2.0.38:9300}{cdhilmrstw}{ml.machine_memory=66548318208, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}] but followup connection failed", 
es01    | "stacktrace": ["org.elasticsearch.transport.ConnectTransportException: [es02][10.2.0.38:9300] connect_exception",
es01    | "at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:978) ~[elasticsearch-7.10.1.jar:7.10.1]",
es01    | "at org.elasticsearch.action.ActionListener.lambda$toBiConsumer$2(ActionListener.java:198) ~[elasticsearch-7.10.1.jar:7.10.1]",
es01    | "at org.elasticsearch.common.concurrent.CompletableContext.lambda$addListener$0(CompletableContext.java:42) ~[elasticsearch-core-7.10.1.jar:7.10.1]",
es01    | "at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[?:?]",
es01    | "at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) ~[?:?]",
es01    | "at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]",
es01    | "at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2152) ~[?:?]",
es01    | "at org.elasticsearch.common.concurrent.CompletableContext.completeExceptionally(CompletableContext.java:57) ~[elasticsearch-core-7.10.1.jar:7.10.1]",
es01    | "at org.elasticsearch.transport.netty4.Netty4TcpChannel.lambda$addListener$0(Netty4TcpChannel.java:68) ~[?:?]",
es01    | "at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577) ~[?:?]",
es01    | "at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:570) ~[?:?]",
es01    | "at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:549) ~[?:?]",
es01    | "at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490) ~[?:?]",
es01    | "at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615) ~[?:?]",
es01    | "at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608) ~[?:?]",
es01    | "at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) ~[?:?]",
es01    | "at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:321) ~[?:?]",
es01    | "at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:337) ~[?:?]",
es01    | "at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702) ~[?:?]",
es01    | "at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:615) ~[?:?]",
es01    | "at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:578) ~[?:?]",
es01    | "at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) ~[?:?]",
es01    | "at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[?:?]",
es01    | "at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]",
es01    | "at java.lang.Thread.run(Thread.java:832) [?:?]",
es01    | "Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 10.2.0.38/10.2.0.38:9300",
  • TCP port 9300 is opened on both machines I can enter from by:
    telnet 10.2.0.243 9300 and vice versa.

What I'm missing ?

The documentation you linked is for when you run the 3 nodes in the same host with docker compose, which seems different from what you are trying to do.

Your architecture is a little confusing, can you explain how it works? Where are you running the docker-compose? What are the IP address of the instances?

Also, this config is duplicated network.publish_host=10.2.0.38, you can't have two different nodes listening using the same IP address and the same port, and your es02 container does not have any port being exposed.

Thanks for reply. So basically I have 2 ec2 machines ip 10.2.0.243 and 10.2.0.38. I want using docker compose setup 3 nodes on them. 2 nodes on 1 machine and the 3rd on second. Following the doc I creates script as in my question and run in on 2 machines. Maybe I need to split the script between the machines ? (If I expose same port on multiple nodes will get port already exists)

The doc assumes that everything is going to run in one machine, your architecture is completely different.

If you are running the same docker-compose in both machines you are starting 6 nodes, 3 on each ec2 instance.

I do not know much about docker, but I don't think that using the IP address of your host as the publish address of your container will work like that as the containers will run on a different network created by docker.

From the docker documentation you have this:

Bridge networks apply to containers running on the same Docker daemon host. For communication among containers running on different Docker daemon hosts, you can either manage routing at the OS level, or you can use an overlay network.

Which basically means that your containers will only be able to connect witch containers running on the same ec2 instance.

Your issue is more related to your network architecture, you just need to make sure that your containers can talk with each other on the publish address.

Maybe the discussion in this post can help a little.

So except the bridge networking issue that I'll change, Do I need to split the script so on each machine will have the relevant part of nodes configuration. E.g machine 10.2.0.38 run partially script that relevant only for ec1,ec2 part, and so on.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.