Master not discovered yet

I'm trying to build a docker swarm elasticsearch cluster and I have the following compose file:

  elastic:
    # image: elasticsearch:7.14.1
    image: elasticsearch:7.6.2
    hostname: elastic.{{.Task.Slot}}
    environment:
      - node.name=elastic.{{.Task.Slot}}
      - cluster.name=docker-cluster
      - cluster.initial_master_nodes=elastic.1,elastic.2,elastic.3
      # - discovery.seed_hosts=tasks.elastic
      - discovery.seed_hosts=elastic.1,elastic.2
      - bootstrap.memory_lock=true
      - xpack.security.enabled=false
      - ES_JAVA_OPTS=-Xms512m -Xmx512m
    ulimits:
      memlock:
        soft: -1
        hard: -1
    restart: unless-stopped
    deploy:
      replicas: 3
      resources:
        limits:
          cpus: '1'
          memory: 1G

But I get the following error message running:

join_elastic.1.71fhh7szr7sa@desktop    | {"type": "server", "timestamp": "2022-06-21T10:22:07,802Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "docker-cluster", "node.name": "elastic.1", "message": "master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and this node must discover master-eligible nodes [elastic.1, elastic.2, elastic.3] to bootstrap a cluster: have discovered [{elastic.1}{3mQAxoBxQvOS75RSLr-8uw}{qPZSuIRhRuaGOWgsaw3hXw}{10.0.0.206}{10.0.0.206:9300}{dilm}{ml.machine_memory=67043934208, xpack.installed=true, ml.max_open_jobs=20}]; discovery will continue using [10.0.19.20:9300, 10.0.19.21:9300] from hosts providers and [{elastic.1}{3mQAxoBxQvOS75RSLr-8uw}{qPZSuIRhRuaGOWgsaw3hXw}{10.0.0.206}{10.0.0.206:9300}{dilm}{ml.machine_memory=67043934208, xpack.installed=true, ml.max_open_jobs=20}] from last-known cluster state; node term 0, last-accepted version 0 in term 0" }
join_elastic.3.394awcygg5j4@desktop    | {"type": "server", "timestamp": "2022-06-21T10:22:16,946Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "docker-cluster", "node.name": "elastic.3", "message": "master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and this node must discover master-eligible nodes [elastic.1, elastic.2, elastic.3] to bootstrap a cluster: have discovered [{elastic.3}{za3wtoYRRD6Mp5GpjAhRLQ}{t5Mgqu3aRIKObZMFGDAsJw}{10.0.0.208}{10.0.0.208:9300}{dilm}{ml.machine_memory=67043934208, xpack.installed=true, ml.max_open_jobs=20}]; discovery will continue using [10.0.19.20:9300, 10.0.19.21:9300] from hosts providers and [{elastic.3}{za3wtoYRRD6Mp5GpjAhRLQ}{t5Mgqu3aRIKObZMFGDAsJw}{10.0.0.208}{10.0.0.208:9300}{dilm}{ml.machine_memory=67043934208, xpack.installed=true, ml.max_open_jobs=20}] from last-known cluster state; node term 0, last-accepted version 0 in term 0" }
join_elastic.2.h2wexvxpvlwd@desktop    | {"type": "server", "timestamp": "2022-06-21T10:22:17,609Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "docker-cluster", "node.name": "elastic.2", "message": "master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and this node must discover master-eligible nodes [elastic.1, elastic.2, elastic.3] to bootstrap a cluster: have discovered [{elastic.2}{nFQB-tblQrmmsvbOzjZ7bw}{UYKi63N6QWmNL3k0l_uQFQ}{10.0.0.207}{10.0.0.207:9300}{dilm}{ml.machine_memory=67043934208, xpack.installed=true, ml.max_open_jobs=20}]; discovery will continue using [10.0.19.20:9300, 10.0.19.21:9300] from hosts providers and [{elastic.2}{nFQB-tblQrmmsvbOzjZ7bw}{UYKi63N6QWmNL3k0l_uQFQ}{10.0.0.207}{10.0.0.207:9300}{dilm}{ml.machine_memory=67043934208, xpack.installed=true, ml.max_open_jobs=20}] from last-known cluster state; node term 0, last-accepted version 0 in term 0" }

For some reason the nodes can't discover each other... Anyone know why?

7.6 is long past EOL and no longer supported. Newer versions have more detailed logging to help explain this sort of problem. See these docs for more information.

Moved to 7.16.2 same error:

join_elastic.3.4pbsscsf6v71@desktop    | {"type": "server", "timestamp": "2022-06-21T11:13:08,499Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "docker-cluster", "node.name": "elastic.3", "message": "master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and this node must discover master-eligible nodes [elastic.1, elastic.2, elastic.3] to bootstrap a cluster: have discovered [{elastic.3}{rNs3MLcvRTSd4_bJJ2IPTQ}{n25tr12xQR-UUTM7eHL7IA}{10.0.0.219}{10.0.0.219:9300}{cdfhilmrstw}]; discovery will continue using [10.0.21.20:9300, 10.0.21.21:9300] from hosts providers and [{elastic.3}{rNs3MLcvRTSd4_bJJ2IPTQ}{n25tr12xQR-UUTM7eHL7IA}{10.0.0.219}{10.0.0.219:9300}{cdfhilmrstw}] from last-known cluster state; node term 0, last-accepted version 0 in term 0" }
join_elastic.2.42nzjg5wivos@desktop    | {"type": "server", "timestamp": "2022-06-21T11:13:09,268Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "docker-cluster", "node.name": "elastic.2", "message": "master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and this node must discover master-eligible nodes [elastic.1, elastic.2, elastic.3] to bootstrap a cluster: have discovered [{elastic.2}{RbsAVp_2RDi0_Mnp1prHWw}{4zOIc6W6TVuWCAp_1Q9vaA}{10.0.0.218}{10.0.0.218:9300}{cdfhilmrstw}]; discovery will continue using [10.0.21.20:9300, 10.0.21.21:9300] from hosts providers and [{elastic.2}{RbsAVp_2RDi0_Mnp1prHWw}{4zOIc6W6TVuWCAp_1Q9vaA}{10.0.0.218}{10.0.0.218:9300}{cdfhilmrstw}] from last-known cluster state; node term 0, last-accepted version 0 in term 0" }
join_elastic.1.643zzeau0zip@desktop    | {"type": "server", "timestamp": "2022-06-21T11:13:09,561Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "docker-cluster", "node.name": "elastic.1", "message": "master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and this node must discover master-eligible nodes [elastic.1, elastic.2, elastic.3] to bootstrap a cluster: have discovered [{elastic.1}{Qt1rVkh1RgStJhAl4qbjdg}{AYhXpnRZQxCuH-Bdr-sCkg}{10.0.0.217}{10.0.0.217:9300}{cdfhilmrstw}]; discovery will continue using [10.0.21.20:9300, 10.0.21.21:9300] from hosts providers and [{elastic.1}{Qt1rVkh1RgStJhAl4qbjdg}{AYhXpnRZQxCuH-Bdr-sCkg}{10.0.0.217}{10.0.0.217:9300}{cdfhilmrstw}] from last-known cluster state; node term 0, last-accepted version 0 in term 0" }

Moving to Elasticsearch 8.2.3 I get the same messages as well as:

join_elastic.1.u0uhzkvgmfwy@desktop    | {"@timestamp":"2022-06-21T11:20:39.797Z", "log.level": "WARN", "message":"completed handshake with [{elastic.2}{JSQSuvhuRreDP37uOpaP4A}{Tq3s9YwdTEaGMepnxao9Vw}{10.0.0.240}{10.0.0.240:9300}{cdfhilmrstw}] at [10.0.23.29:9300] but followup connection to [10.0.0.240:9300] failed", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elastic.1][generic][T#1]","log.logger":"org.elasticsearch.discovery.HandshakingTransportAddressConnector","elasticsearch.node.name":"elastic.1","elasticsearch.cluster.name":"docker-cluster","error.type":"org.elasticsearch.transport.ConnectTransportException","error.message":"[elastic.2][10.0.0.240:9300] connect_timeout[30s]","error.stack_trace":"org.elasticsearch.transport.ConnectTransportException: [elastic.2][10.0.0.240:9300] connect_timeout[30s]\n\tat org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:1112)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:714)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\n\tat java.base/java.lang.Thread.run(Thread.java:833)\n"}
join_elastic.2.oalt2w13eyi9@desktop    | {"@timestamp":"2022-06-21T11:20:39.807Z", "log.level": "WARN", "message":"completed handshake with [{elastic.1}{5-1-Y3yORxqxBMMjUiNlyg}{y-2UMAGfQ7aLrZ-4nqBMCg}{10.0.0.242}{10.0.0.242:9300}{cdfhilmrstw}] at [10.0.23.31:9300] but followup connection to [10.0.0.242:9300] failed", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elastic.2][generic][T#1]","log.logger":"org.elasticsearch.discovery.HandshakingTransportAddressConnector","elasticsearch.node.name":"elastic.2","elasticsearch.cluster.name":"docker-cluster","error.type":"org.elasticsearch.transport.ConnectTransportException","error.message":"[elastic.1][10.0.0.242:9300] connect_timeout[30s]","error.stack_trace":"org.elasticsearch.transport.ConnectTransportException: [elastic.1][10.0.0.242:9300] connect_timeout[30s]\n\tat org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:1112)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:714)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\n\tat java.base/java.lang.Thread.run(Thread.java:833)\n"}
join_elastic.3.svvkok858ld1@desktop    | {"@timestamp":"2022-06-21T11:20:40.496Z", "log.level": "WARN", "message":"completed handshake with [{elastic.2}{JSQSuvhuRreDP37uOpaP4A}{Tq3s9YwdTEaGMepnxao9Vw}{10.0.0.240}{10.0.0.240:9300}{cdfhilmrstw}] at [10.0.23.29:9300] but followup connection to [10.0.0.240:9300] failed", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elastic.3][generic][T#2]","log.logger":"org.elasticsearch.discovery.HandshakingTransportAddressConnector","elasticsearch.node.name":"elastic.3","elasticsearch.cluster.name":"docker-cluster","error.type":"org.elasticsearch.transport.ConnectTransportException","error.message":"[elastic.2][10.0.0.240:9300] connect_timeout[30s]","error.stack_trace":"org.elasticsearch.transport.ConnectTransportException: [elastic.2][10.0.0.240:9300] connect_timeout[30s]\n\tat org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:1112)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:714)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\n\tat java.base/java.lang.Thread.run(Thread.java:833)\n"}
join_elastic.3.svvkok858ld1@desktop    | {"@timestamp":"2022-06-21T11:20:40.496Z", "log.level": "WARN", "message":"completed handshake with [{elastic.1}{5-1-Y3yORxqxBMMjUiNlyg}{y-2UMAGfQ7aLrZ-4nqBMCg}{10.0.0.242}{10.0.0.242:9300}{cdfhilmrstw}] at [10.0.23.31:9300] but followup connection to [10.0.0.242:9300] failed", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elastic.3][generic][T#3]","log.logger":"org.elasticsearch.discovery.HandshakingTransportAddressConnector","elasticsearch.node.name":"elastic.3","elasticsearch.cluster.name":"docker-cluster","error.type":"org.elasticsearch.transport.ConnectTransportException","error.message":"[elastic.1][10.0.0.242:9300] connect_timeout[30s]","error.stack_trace":"org.elasticsearch.transport.ConnectTransportException: [elastic.1][10.0.0.242:9300] connect_timeout[30s]\n\tat org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:1112)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:714)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\n\tat java.base/java.lang.Thread.run(Thread.java:833)\n"}
join_elastic.3.svvkok858ld1@desktop    | {"@timestamp":"2022-06-21T11:20:47.478Z", "log.level": "WARN", "message":"master not discovered yet, this node has not previously joined a bootstrapped cluster, and this node must discover master-eligible nodes [elastic.1, elastic.2, elastic.3] to bootstrap a cluster: have discovered [{elastic.3}{SWAdmCU8THyAjXLkahmfSA}{VmN3WLXqRW2f0T2oerSqpg}{10.0.0.241}{10.0.0.241:9300}{cdfhilmrstw}]; discovery will continue using [10.0.23.31:9300, 10.0.23.29:9300] from hosts providers and [{elastic.3}{SWAdmCU8THyAjXLkahmfSA}{VmN3WLXqRW2f0T2oerSqpg}{10.0.0.241}{10.0.0.241:9300}{cdfhilmrstw}] from last-known cluster state; node term 0, last-accepted version 0 in term 0", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elastic.3][cluster_coordination][T#1]","log.logger":"org.elasticsearch.cluster.coordination.ClusterFormationFailureHelper","elasticsearch.node.name":"elastic.3","elasticsearch.cluster.name":"docker-cluster"}
join_elastic.1.u0uhzkvgmfwy@desktop    | {"@timestamp":"2022-06-21T11:20:48.785Z", "log.level": "WARN", "message":"master not discovered yet, this node has not previously joined a bootstrapped cluster, and this node must discover master-eligible nodes [elastic.1, elastic.2, elastic.3] to bootstrap a cluster: have discovered [{elastic.1}{5-1-Y3yORxqxBMMjUiNlyg}{y-2UMAGfQ7aLrZ-4nqBMCg}{10.0.0.242}{10.0.0.242:9300}{cdfhilmrstw}]; discovery will continue using [10.0.23.31:9300, 10.0.23.29:9300] from hosts providers and [{elastic.1}{5-1-Y3yORxqxBMMjUiNlyg}{y-2UMAGfQ7aLrZ-4nqBMCg}{10.0.0.242}{10.0.0.242:9300}{cdfhilmrstw}] from last-known cluster state; node term 0, last-accepted version 0 in term 0", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elastic.1][cluster_coordination][T#1]","log.logger":"org.elasticsearch.cluster.coordination.ClusterFormationFailureHelper","elasticsearch.node.name":"elastic.1","elasticsearch.cluster.name":"docker-cluster"}
join_elastic.2.oalt2w13eyi9@desktop    | {"@timestamp":"2022-06-21T11:20:48.794Z", "log.level": "WARN", "message":"master not discovered yet, this node has not previously joined a bootstrapped cluster, and this node must discover master-eligible nodes [elastic.1, elastic.2, elastic.3] to bootstrap a cluster: have discovered [{elastic.2}{JSQSuvhuRreDP37uOpaP4A}{Tq3s9YwdTEaGMepnxao9Vw}{10.0.0.240}{10.0.0.240:9300}{cdfhilmrstw}]; discovery will continue using [10.0.23.31:9300, 10.0.23.29:9300] from hosts providers and [{elastic.2}{JSQSuvhuRreDP37uOpaP4A}{Tq3s9YwdTEaGMepnxao9Vw}{10.0.0.240}{10.0.0.240:9300}{cdfhilmrstw}] from last-known cluster state; node term 0, last-accepted version 0 in term 0", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elastic.2][cluster_coordination][T#1]","log.logger":"org.elasticsearch.cluster.coordination.ClusterFormationFailureHelper","elasticsearch.node.name":"elastic.2","elasticsearch.cluster.name":"docker-cluster"}

Yes, that's telling you the problem now:

completed handshake with [{elastic.2}{JSQSuvhuRreDP37uOpaP4A}{Tq3s9YwdTEaGMepnxao9Vw}{10.0.0.240}{10.0.0.240:9300}{cdfhilmrstw}] at [10.0.23.29:9300] but followup connection to [10.0.0.240:9300] failed

Note the mismatched addresses.

10.0.23.29:9300 vs 10.0.0.240:9300 ?

They should have the same address? How do I correct this inside the docker compose configuration?

Not sure, sorry, this is more of a Docker question than anything to do with Elasticsearch. All the nodes need to have the same view of the network.

I'm not really familiar enough with the elasticsearch internals.

It looks like the nodes have the ips:
elastic.1 10.0.0.242
elastic.2 10.0.0.240
elastic.3 10.0.0.241

elastic.1 successfully does a handshake with elastic.2 at ip 10.0.23.29:9300 which is different from elastic.2 original IP 10.0.0.240 for some reason?

Then tries to connect to 10.0.0.240:9300 but connection fails.

I don't understand where 10.0.23.29 is coming from.

It's coming from the DNS lookups here:

Oke so elastic.3 does a handshake with elastic.1 at 10.0.36.3:9300 then proceeds to connect to 10.0.0.118:9300.

join_elastic.3.l9xy61pb16ga@desktop    | {"@timestamp":"2022-06-21T13:16:05.833Z", "log.level": "WARN", "message":"completed handshake with [{elastic.1}{dgKS-oRlRiqC2yp3c9_1jQ}{CzsGubIKRYOlQVq4KybQNw}{10.0.0.118}{10.0.0.118:9300}{cdfhilmrstw}] at [10.0.36.3:9300] but followup connection to [10.0.0.118:9300] failed", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elastic.3][generic][T#4]","log.logger":"org.elasticsearch.discovery.HandshakingTransportAddressConnector","elasticsearch.node.name":"elastic.3","elasticsearch.cluster.name":"docker-cluster","error.type":"org.elasticsearch.transport.ConnectTransportException","error.message":"[elastic.1][10.0.0.118:9300] connect_timeout[30s]","error.stack_trace":"org.elasticsearch.transport.ConnectTransportException: [elastic.1][10.0.0.118:9300] connect_timeout[30s]\n\tat org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:1112)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:714)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\n\tat java.base/java.lang.Thread.run(Thread.java:833)\n"}

If I inspect elastic.1 I see the following networks:

"Networks": {
    "ingress": {
        "IPAMConfig": {
            "IPv4Address": "10.0.0.118"
        },
        "Links": null,
        "Aliases": [
            "33bd3ce845ae",
            "elastic.1"
        ],
        "NetworkID": "u5xapq975n7vbqxt4i0t7sjg3",
        "EndpointID": "fc43b9cf6308c84c859e34998c8f344364b40bc59afa2cc910693782271bbcda",
        "Gateway": "",
        "IPAddress": "10.0.0.118",
        "IPPrefixLen": 24,
        "IPv6Gateway": "",
        "GlobalIPv6Address": "",
        "GlobalIPv6PrefixLen": 0,
        "MacAddress": "02:42:0a:00:00:76",
        "DriverOpts": null
    },
    "default": {
        "IPAMConfig": {
            "IPv4Address": "10.0.36.3"
        },
        "Links": null,
        "Aliases": [
            "33bd3ce845ae",
            "elastic.1"
        ],
        "NetworkID": "v0ozxus7tcy0mrzy84adohkpo",
        "EndpointID": "c1b8ee98d0355387e6b0d93a25b14bc937fbb95cc22a6305c972d4fca8cf6367",
        "Gateway": "",
        "IPAddress": "10.0.36.3",
        "IPPrefixLen": 24,
        "IPv6Gateway": "",
        "GlobalIPv6Address": "",
        "GlobalIPv6PrefixLen": 0,
        "MacAddress": "02:42:0a:00:24:03",
        "DriverOpts": null
    }
}

If I'm correct, docker is using network ingress to do the handshake and then proceeds to use network default to connect.

Should both the handshake and the connection not go over the default network?

My hypothesis is correct, adding the option:

     environment:
      - network.host=_eth1_

excludes the ingress network and forces elasticsearch to use the default network.

Any other cluster software I'm using picks the correct interface or default network. I don't understand why elasticsearch tries to use both interfaces??? Is this a bug?

I don't think we'd consider it an Elasticseasrch bug. It's important that all the nodes have the same view of the network, but that's something we expect you to arrange.

But why does my mongodb cluster or redis cluster or my webserver for that matter always connect over the default network when resolving host names?

Why does elasticsearch initiate a handshake over the ingress network (eth0) and then proceeds to connect through the default network (eth1)?

This isn't very consistent.

If you run Elasticsearch in an environment that has multiple network interfaces, but don't specify which one you want it to use, then it assumes you don't care which one it should use (i.e. it expects them all to have equivalent connectivity) and just picks one. Your environment isn't set up like that.

1 Like