I'm pulling my hair out trying to figure out why my three nodes, running as docker-compose containers on 3 separate Ubuntu 22.04LTS hosts, cannot start.
The 3 nodes discover each other but fail to elect a master. When I curl to get the list of nodes, I get a "master not discovered exception".
Here are the repeating log statements on the 3 nodes:
node01-master    | {"type": "server", "timestamp": "2023-08-04T15:54:14,467Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "es-cluster-test-73", "node.name": "node01-master", "message": "master not discovered or elected yet, an election requires at least 2 nodes with ids from [XSnNy2f2QOmu9ZrAbEqeGA, TWCTna4GRzSk4NQmYLe8aw, CrOUreENRAKBvUHmMSqIlA], have only discovered non-quorum [{node01-master}{CrOUreENRAKBvUHmMSqIlA}{ODyLuSnYSNusYwdWUa-tFg}{172.16.0.152}{172.16.0.152:9307}{cdfhimrstw}, {node02-data}{PjWpoi54RIuHbTbNN2eqzQ}{NnQM-vFpQd6nC-BQa5QFGg}{172.16.0.149}{172.16.0.149:9307}{cdfhimrstw}, {node03-data}{gcWZXovdQGST_aVd1sedhQ}{YYN3f3gbTcS0_mDrfoasGg}{172.16.0.148}{172.16.0.148:9307}{cdfhimrstw}]; discovery will continue using [172.16.0.149:9307, 172.16.0.148:9307] from hosts providers and [{node01-master}{CrOUreENRAKBvUHmMSqIlA}{ODyLuSnYSNusYwdWUa-tFg}{172.16.0.152}{172.16.0.152:9307}{cdfhimrstw}] from last-known cluster state; node term 2, last-accepted version 101 in term 2" }
As you can see, one of the two expected IDs has been discovered.
node02-data    | {"type": "server", "timestamp": "2023-08-04T15:54:25,534Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "es-cluster-test-73", "node.name": "node02-data", "message": "master not discovered or elected yet, an election requires two nodes with ids [PjWpoi54RIuHbTbNN2eqzQ, CrOUreENRAKBvUHmMSqIlA], have discovered possible quorum [{node02-data}{PjWpoi54RIuHbTbNN2eqzQ}{NnQM-vFpQd6nC-BQa5QFGg}{172.16.0.149}{172.16.0.149:9307}{cdfhimrstw}, {node01-master}{CrOUreENRAKBvUHmMSqIlA}{ODyLuSnYSNusYwdWUa-tFg}{172.16.0.152}{172.16.0.152:9307}{cdfhimrstw}, {node03-data}{gcWZXovdQGST_aVd1sedhQ}{YYN3f3gbTcS0_mDrfoasGg}{172.16.0.148}{172.16.0.148:9307}{cdfhimrstw}]; discovery will continue using [172.16.0.152:9307, 172.16.0.148:9307] from hosts providers and [{node02-data}{PjWpoi54RIuHbTbNN2eqzQ}{NnQM-vFpQd6nC-BQa5QFGg}{172.16.0.149}{172.16.0.149:9307}{cdfhimrstw}] from last-known cluster state; node term 0, last-accepted version 0 in term 0" }
As you can see, both of the two expected IDs have been discovered.
node03-data    | {"type": "server", "timestamp": "2023-08-04T15:54:35,075Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "es-cluster-test-73", "node.name": "node03-data", "message": "master not discovered or elected yet, an election requires 2 nodes with ids [gcWZXovdQGST_aVd1sedhQ, CrOUreENRAKBvUHmMSqIlA], have discovered possible quorum [{node03-data}{gcWZXovdQGST_aVd1sedhQ}{YYN3f3gbTcS0_mDrfoasGg}{172.16.0.148}{172.16.0.148:9307}{cdfhimrstw}, {node01-master}{CrOUreENRAKBvUHmMSqIlA}{ODyLuSnYSNusYwdWUa-tFg}{172.16.0.152}{172.16.0.152:9307}{cdfhimrstw}, {node02-data}{PjWpoi54RIuHbTbNN2eqzQ}{NnQM-vFpQd6nC-BQa5QFGg}{172.16.0.149}{172.16.0.149:9307}{cdfhimrstw}]; discovery will continue using [172.16.0.152:9307, 172.16.0.149:9307] from hosts providers and [{node03-data}{gcWZXovdQGST_aVd1sedhQ}{YYN3f3gbTcS0_mDrfoasGg}{172.16.0.148}{172.16.0.148:9307}{cdfhimrstw}] from last-known cluster state; node term 0, last-accepted version 0 in term 0" }
As you can see, again, both of the two expected IDs have been discovered.
I don't know how important the last line is in each log line, about the "last-accepted version" cluster state and the "term". The 1st node keeps on giving the same values for those. The other two are at 0 for both.
Here is the docker-compose.yaml for the 1st node (the others are of course very similar). The IP addresses you see belong to the docker hosts.
version: '3.7'
services:
  node01-master:
    build: .
    container_name: node01-master
    hostname: es-master
    environment:
      - node.name=node01-master
      - cluster.name=es-cluster-test-73
      - discovery.seed_hosts=172.16.0.152:9307,172.16.0.149:9307,172.16.0.148:9307
      - cluster.initial_master_nodes=node01-master,node02-data,node03-data
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms256M -Xmx256M"
      - node.master=true
      - node.voting_only=false
      - node.data=true
      - node.ingest=true
      - node.ml=false
      - xpack.ml.enabled=true
      - cluster.remote.connect=true
      - network.publish_host=172.16.0.152
      - transport.publish_port=9307
      - http.publish_port=9307
    volumes:
      - data01:/usr/share/elasticsearch/data
    ulimits:
      memlock:
        soft: -1
        hard: -1
    ports:
      - "9207:9200"
      - "9307:9300"
    networks:
      - elastic_73
    restart: always
volumes:
  data01:
    driver: local
networks:
  elastic_73:
    driver: bridge
    ipam:
      driver: default
      config:
      - subnet: 10.40.4.1/24
I'd be grateful for any help!
- George