Docker swarm deployment: master_not_discovered_exception

I am basing my deployment off of the docker-compose.yml file found here. My modified yml file is pasted below.

The docker swarm gets all three services running (only one replicated container per service) but the logs show exceptions such as

"stacktrace": ["org.elasticsearch.transport.RemoteTransportException: [es03]

An attempt to get status gets a result such as the following:

curl -XGET http://helium:9200/_cat/allocation?pretty
{
  "error" : {
    "root_cause" : [
      {
        "type" : "master_not_discovered_exception",
        "reason" : null
      }
    ],
    "type" : "master_not_discovered_exception",
    "reason" : null
  },
  "status" : 503
}

To diagnose this I invoked a shell into each of the containers, and was able to communicate between them. A curl GET to port 9200 of any of the containers looks normal, and port 9300 looks like this:

[root@39043ca9d9dd elasticsearch]#curl -XGET http://es02:9300
This is not an HTTP port

This indicates to me that the connectivity is there at the container level via the docker overlay network, but apparently Elasticsearch itself doesn't seem to be able to use that connection.

Any suggestions?

version: '3.0'
services:
  es01:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.5.2
    container_name: es01
    environment:
      - node.name=es01
      - cluster.name=es-docker-cluster
      - discovery.seed_hosts=es02,es03
      - cluster.initial_master_nodes=es01,es02,es03
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - nfs-es01:/usr/share/elasticsearch/data
    ports:
      - 9200:9200
    networks:
      - elastic
  es02:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.5.2
    container_name: es02
    environment:
      - node.name=es02
      - cluster.name=es-docker-cluster
      - discovery.seed_hosts=es01,es03
      - cluster.initial_master_nodes=es01,es02,es03
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - nfs-es02:/usr/share/elasticsearch/data
    networks:
      - elastic
  es03:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.5.2
    container_name: es03
    environment:
      - node.name=es03
      - cluster.name=es-docker-cluster
      - discovery.seed_hosts=es01,es02
      - cluster.initial_master_nodes=es01,es02,es03
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - nfs-es03:/usr/share/elasticsearch/data
    networks:
      - elastic

volumes:
  nfs-es01:
    driver_opts:
      type: nfs
      o: addr=10.2.0.1,rw,nfsvers=4,local_lock=all
      device: :/sbn/process3/elasticsearch01
  nfs-es02:
    driver_opts:
      type: nfs
      o: addr=10.2.0.1,rw,nfsvers=4,local_lock=all
      device: :/sbn/process3/elasticsearch02
  nfs-es03:
    driver_opts:
      type: nfs
      o: addr=10.2.0.1,rw,nfsvers=4,local_lock=all
      device: :/sbn/process3/elasticsearch03

networks:
  elastic:
   external: true
   driver: overlay

The logs typically have the info needed to work out what's wrong, so my suggestion would be to look there. Share them here if you need help understanding them.

David I did post the relevant line item from the log. I don't think the full stack trace is really going to help but just in case it does I have a larger segment pasted below.

I haven't read the source code but it is pretty clear that that the server es01 attempted to establish a connection to es03 and failed. The logs contain a similar reports for all combinations of es01, es02, and es03. The connection attempt times out.

I suspect that somehow Elasticsearch is not getting the DNS names resolved correctly even though the command line does. I can't think of why that would be.

p3es_es01.1.t417ccm7zh6h@boron    | {"type": "server", "timestamp": "2021-11-27T21:15:09,752Z", "level": "INFO", "component": "o.e.c.c.JoinHelper", "cluster.name": "es-docker-cluster", "node.name": "es01", "message": "failed to join {es03}{akrtWEgbR6eI89b-X7mzmw}{-BuBEtuXRTyUPxK7r4GvJA}{10.0.3.189}{10.0.3.189:9300}{dilm}{ml.machine_memory=33668845568, ml.max_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={es01}{_r72fbKhTRGsStILTChHhQ}{VF-VL3LMTrK9_cqXx4g6mw}{10.0.0.35}{10.0.0.35:9300}{dilm}{ml.machine_memory=101314478080, xpack.installed=true, ml.max_open_jobs=20}, optionalJoin=Optional[Join{term=11, lastAcceptedTerm=2, lastAcceptedVersion=19, sourceNode={es01}{_r72fbKhTRGsStILTChHhQ}{VF-VL3LMTrK9_cqXx4g6mw}{10.0.0.35}{10.0.0.35:9300}{dilm}{ml.machine_memory=101314478080, xpack.installed=true, ml.max_open_jobs=20}, targetNode={es03}{akrtWEgbR6eI89b-X7mzmw}{-BuBEtuXRTyUPxK7r4GvJA}{10.0.3.189}{10.0.3.189:9300}{dilm}{ml.machine_memory=33668845568, ml.max_open_jobs=20, xpack.installed=true}}]}", 
p3es_es01.1.t417ccm7zh6h@boron    | "stacktrace": ["org.elasticsearch.transport.RemoteTransportException: [es03][10.0.3.189:9300][internal:cluster/coordination/join]",
p3es_es01.1.t417ccm7zh6h@boron    | "Caused by: org.elasticsearch.transport.ConnectTransportException: [es01][10.0.0.35:9300] connect_exception",
p3es_es01.1.t417ccm7zh6h@boron    | "at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:989) ~[elasticsearch-7.5.2.jar:7.5.2]",
p3es_es01.1.t417ccm7zh6h@boron    | "at org.elasticsearch.action.ActionListener.lambda$toBiConsumer$3(ActionListener.java:162) ~[elasticsearch-7.5.2.jar:7.5.2]",
p3es_es01.1.t417ccm7zh6h@boron    | "at org.elasticsearch.common.concurrent.CompletableContext.lambda$addListener$0(CompletableContext.java:42) ~[elasticsearch-core-7.5.2.jar:7.5.2]",
p3es_es01.1.t417ccm7zh6h@boron    | "at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[?:?]",
p3es_es01.1.t417ccm7zh6h@boron    | "at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) ~[?:?]",
p3es_es01.1.t417ccm7zh6h@boron    | "at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]",
p3es_es01.1.t417ccm7zh6h@boron    | "at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2159) ~[?:?]",
p3es_es01.1.t417ccm7zh6h@boron    | "at org.elasticsearch.common.concurrent.CompletableContext.completeExceptionally(CompletableContext.java:57) ~[elasticsearch-core-7.5.2.jar:7.5.2]",
p3es_es01.1.t417ccm7zh6h@boron    | "at org.elasticsearch.transport.netty4.Netty4TcpChannel.lambda$addListener$0(Netty4TcpChannel.java:68) ~[transport-netty4-client-7.5.2.jar:7.5.2]",
p3es_es01.1.t417ccm7zh6h@boron    | "at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577) ~[netty-common-4.1.43.Final.jar:4.1.43.Final]",
p3es_es01.1.t417ccm7zh6h@boron    | "at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:570) ~[netty-common-4.1.43.Final.jar:4.1.43.Final]",
p3es_es01.1.t417ccm7zh6h@boron    | "at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:549) ~[netty-common-4.1.43.Final.jar:4.1.43.Final]",
p3es_es01.1.t417ccm7zh6h@boron    | "at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490) ~[netty-common-4.1.43.Final.jar:4.1.43.Final]",
p3es_es01.1.t417ccm7zh6h@boron    | "at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615) ~[netty-common-4.1.43.Final.jar:4.1.43.Final]",
p3es_es01.1.t417ccm7zh6h@boron    | "at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608) ~[netty-common-4.1.43.Final.jar:4.1.43.Final]",
p3es_es01.1.t417ccm7zh6h@boron    | "at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) ~[netty-common-4.1.43.Final.jar:4.1.43.Final]",
p3es_es01.1.t417ccm7zh6h@boron    | "at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:263) ~[netty-transport-4.1.43.Final.jar:4.1.43.Final]",
p3es_es01.1.t417ccm7zh6h@boron    | "at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) ~[netty-common-4.1.43.Final.jar:4.1.43.Final]",
p3es_es01.1.t417ccm7zh6h@boron    | "at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:150) ~[netty-common-4.1.43.Final.jar:4.1.43.Final]",
p3es_es01.1.t417ccm7zh6h@boron    | "at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) ~[netty-common-4.1.43.Final.jar:4.1.43.Final]",
p3es_es01.1.t417ccm7zh6h@boron    | "at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:510) ~[netty-common-4.1.43.Final.jar:4.1.43.Final]",
p3es_es01.1.t417ccm7zh6h@boron    | "at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:518) [netty-transport-4.1.43.Final.jar:4.1.43.Final]",
p3es_es01.1.t417ccm7zh6h@boron    | "at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1050) [netty-common-4.1.43.Final.jar:4.1.43.Final]",
p3es_es01.1.t417ccm7zh6h@boron    | "at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.43.Final.jar:4.1.43.Final]",
p3es_es01.1.t417ccm7zh6h@boron    | "at java.lang.Thread.run(Thread.java:830) [?:?]",
p3es_es01.1.t417ccm7zh6h@boron    | "Caused by: java.io.IOException: connection timed out: 10.0.0.35/10.0.0.35:9300",
p3es_es01.1.t417ccm7zh6h@boron    | "at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:261) ~[netty-transport-4.1.43.Final.jar:4.1.43.Final]",
p3es_es01.1.t417ccm7zh6h@boron    | "at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) ~[netty-common-4.1.43.Final.jar:4.1.43.Final]",
p3es_es01.1.t417ccm7zh6h@boron    | "at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:150) ~[netty-common-4.1.43.Final.jar:4.1.43.Final]",
p3es_es01.1.t417ccm7zh6h@boron    | "at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) ~[netty-common-4.1.43.Final.jar:4.1.43.Final]",
p3es_es01.1.t417ccm7zh6h@boron    | "at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:510) ~[netty-common-4.1.43.Final.jar:4.1.43.Final]",
p3es_es01.1.t417ccm7zh6h@boron    | "at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:518) [netty-transport-4.1.43.Final.jar:4.1.43.Final]",
p3es_es01.1.t417ccm7zh6h@boron    | "at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1050) ~[?:?]",
p3es_es01.1.t417ccm7zh6h@boron    | "at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]",
p3es_es01.1.t417ccm7zh6h@boron    | "at java.lang.Thread.run(Thread.java:830) ~[?:?]"] }

This won't be the only relevant message in the logs, although it's already more useful than the fragment you originally shared. You're using 7.5.2 which is very old, long past EOL, and yes it's a connection timeout. Try a recent version (they have better logging if nothing else) and then please share a more complete set of logs.

I upgraded Elasticsearch -- no difference.

However I seem to have come up with a partial solution.

The issue goes away when I remove the publication of port 9200 on the first service es01:

    ports:
      - 9200:9200

What seems to happen is including this directive causes docker stack to an additional network to each of the containers. Somehow Elasticsearch gets confused about which IP address to communicate with. This is shown in the logs above by this line:

p3es_es01.1.t417ccm7zh6h@boron    | "Caused by: org.elasticsearch.transport.ConnectTransportException: [es01][10.0.0.35:9300] connect_exception"

The service es01 is not located at 10.0.0.35. Attempting to initiate connection port 9300 (or 9200) at that address is rejected.

I can't tell if this is an Elasticsearch issue or a Docker issue.

For me, not publishing port 9200 is fine because I always intended to confine that traffic to the private overlay network anyway for security. However this might not be OK with someone else attempting to do this, so I it might be worth someone at Elasticsearch to come up with a more robust solution. It might simply be a configuration option to direct Elasticsearch to use the correct IP address, so the fix is simply a documentation update.

Thanks for the help, and hopefully this e-mail thread will save someone else the trouble.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.