Master not discovered/elected after making changes to docker-compose.yml

Hello,

I've got a 3 node ES cluster running alongside Kibana as part of a docker-compose file. My compose file is below:

    version: '2.2'
    services:
      es01:
        image: docker.elastic.co/elasticsearch/elasticsearch:7.5.0
        container_name: es01
        environment:
          - node.name=es01
          - cluster.name=es-docker-cluster
          - discovery.seed_hosts=es02,es03
          - cluster.initial_master_nodes=es01,es02,es03
          - "ES_JAVA_OPTS=-Xms7g -Xmx7g"
        volumes:
          - data01:/usr/share/elasticsearch/data
        ports:
          - "127.0.0.1:9200:9200"
        networks:
          - esnet
      es02:
        image: docker.elastic.co/elasticsearch/elasticsearch:7.5.0
        container_name: es02
        environment:
          - node.name=es02
          - cluster.name=es-docker-cluster
          - discovery.seed_hosts=es01,es03
          - cluster.initial_master_nodes=es01,es02,es03
          - "ES_JAVA_OPTS=-Xms7g -Xmx7g"
        volumes:
          - data02:/usr/share/elasticsearch/data
        networks:
          - esnet
      es03:
        image: docker.elastic.co/elasticsearch/elasticsearch:7.5.0
        container_name: es03
        environment:
          - node.name=es03
          - cluster.name=es-docker-cluster
          - discovery.seed_hosts=es01,es02
          - cluster.initial_master_nodes=es01,es02,es03
          - "ES_JAVA_OPTS=-Xms7g -Xmx7g"
        volumes:
          - data03:/usr/share/elasticsearch/data
        networks:
          - esnet
      kibana:
          image: docker.elastic.co/kibana/kibana:7.5.0
          ports:
            - "5601:5601"
          depends_on:
            - es03
          environment:
            - ELASTICSEARCH_HOSTS=http://es01:9200
          networks:
            - esnet

    volumes:
      data01:
        driver: local
      data02:
        driver: local
      data03:
        driver: local
    
    networks:
          esnet:

This setup was running perfectly well for a couple of months until I had to make some changes (adding ulimit settings to prevent swapping) to docker-compose.yml. Before making the changes, I brought down the containers using docker-compose down. I then added the necessary lines and brought the containers back up.

Since bringing them back up, I've been getting errors from each of the Elasticsearch nodes complaining about an inability to elect a master. The output is as follows:

    es01      | {"type": "server", "timestamp": "2020-07-16T13:50:01,439Z", "level": "WARN", "component": "o.e.c.c.ClusterFor
    mationFailureHelper", "cluster.name": "es-docker-cluster", "node.name": "es01", "message": "master not discovered or elec
    ted yet, an election requires at least 2 nodes with ids from [FpSZvU6nRl-2KYRyZJ1i2Q, qrujmnYwS5G9HdZ4SIlweA, XdKCB9MvSr-
    8FZMWHpzScg], have discovered [{es01}{qrujmnYwS5G9HdZ4SIlweA}{o2wyby9DSjyBHA-bwtcalg}{192.168.208.2}{192.168.208.2:9300}{
    dilm}{ml.machine_memory=33687420928, xpack.installed=true, ml.max_open_jobs=20}] which is not a quorum; discovery will co
    ntinue using [192.168.208.3:9300, 192.168.208.4:9300] from hosts providers and [{es01}{qrujmnYwS5G9HdZ4SIlweA}{o2wyby9DSj
    yBHA-bwtcalg}{192.168.208.2}{192.168.208.2:9300}{dilm}{ml.machine_memory=33687420928, xpack.installed=true, ml.max_open_j
    obs=20}] from last-known cluster state; node term 49, last-accepted version 2490 in term 49" }
    es02      | {"type": "server", "timestamp": "2020-07-16T13:50:04,900Z", "level": "WARN", "component": "o.e.c.c.ClusterFor
    mationFailureHelper", "cluster.name": "es-docker-cluster", "node.name": "es02", "message": "master not discovered or elec
    ted yet, an election requires at least 2 nodes with ids from [FpSZvU6nRl-2KYRyZJ1i2Q, qrujmnYwS5G9HdZ4SIlweA, XdKCB9MvSr-
    8FZMWHpzScg], have discovered [{es02}{XdKCB9MvSr-8FZMWHpzScg}{-t78nWdqQHOanziElOv1fg}{192.168.208.3}{192.168.208.3:9300}{
    dilm}{ml.machine_memory=33687420928, xpack.installed=true, ml.max_open_jobs=20}] which is not a quorum; discovery will co
    ntinue using [192.168.208.2:9300, 192.168.208.4:9300] from hosts providers and [{es02}{XdKCB9MvSr-8FZMWHpzScg}{-t78nWdqQH
    OanziElOv1fg}{192.168.208.3}{192.168.208.3:9300}{dilm}{ml.machine_memory=33687420928, xpack.installed=true, ml.max_open_j
    obs=20}] from last-known cluster state; node term 49, last-accepted version 2491 in term 49" }
    es03      | {"type": "server", "timestamp": "2020-07-16T13:50:06,281Z", "level": "WARN", "component": "o.e.c.c.ClusterFor
    mationFailureHelper", "cluster.name": "es-docker-cluster", "node.name": "es03", "message": "master not discovered or elec
    ted yet, an election requires at least 2 nodes with ids from [FpSZvU6nRl-2KYRyZJ1i2Q, qrujmnYwS5G9HdZ4SIlweA, XdKCB9MvSr-
    8FZMWHpzScg], have discovered [{es03}{FpSZvU6nRl-2KYRyZJ1i2Q}{8TvEjBy2SMuNagcP0avCOw}{192.168.208.4}{192.168.208.4:9300}{
    dilm}{ml.machine_memory=33687420928, xpack.installed=true, ml.max_open_jobs=20}] which is not a quorum; discovery will co
    ntinue using [192.168.208.2:9300, 192.168.208.3:9300] from hosts providers and [{es03}{FpSZvU6nRl-2KYRyZJ1i2Q}{8TvEjBy2SM
    uNagcP0avCOw}{192.168.208.4}{192.168.208.4:9300}{dilm}{ml.machine_memory=33687420928, xpack.installed=true, ml.max_open_j
    obs=20}] from last-known cluster state; node term 49, last-accepted version 2491 in term 49" }

I reverted the compose file to the previously working version, seen above. After bringing the containers back up, I'm still getting the same error. I suspect that part of the problem is to do with persisting the volumes for each of the containers. Is it possible that the cluster is retaining a configuration from before, but now that the containers have been rebuilt some of that information has changed? I need to retain the data from the ES cluster and Kibana.

To troubleshoot, I've tried spawning a shell into each of the ES nodes and curling each of the other nodes in the cluster. This doesn't seem to work when using the node names specified in the docker-compose file with ports associated with ES (9200/9300). I can, however, ping the IP for each of the nodes and get a response, so it seems like the nodes can actually communicate.

Happy to provide more information if necessary.

Each node reports that it can only see itself:

Looks like a connectivity issue to me.

Did you try the corresponding IP addresses? Elasticsearch is trying things like 192.168.208.2:9300 so curl http://192.168.208.2:9300/ should return This is not an HTTP port on every node. If it doesn't then it's definitely a connectivity issue.

Have you checked the IPs the containers are using? Also why is ports only specified on es01, not 2/3 ? And I think the ""127.0.0.1:9200:9200"" will only listen on the localhost, i.e. other hosts won't be able to reach this container, I believe.

Suggest using telnet on each node to ensure can reach the IP/port on each of the other nodes using the IPs the cluster expects 208.2/3/4 - until that works, nothing else will. You might also check on each node that you can reach 127.0.0.1:9300 also (though this can get weird in docker, often the node IP is better).

Note I'm not sure what "adding ulimit settings to prevent swapping" means as ulimit not related to swapping.

Thanks for the responses.

I've commented out the lines defining the network and ports in the compose file - somehow that's sorted everything. I must have defined the network incorrectly to begin with. The cluster forms if I just let Docker take care of the networking. Kibana is also now able to communicate with ES and I can query my index data.

1 Like

Re: ulimit and swapping, I was referring to setting the memlock ulimit to unlimited. I'd read that this prevents swapping the JVM heap to disk and it appears to do so. If this is not recommended for some reason then please let me know.

For ulimit, confused me a bit as kinda mixing unlimited and ulimit, which are not the same thing, but are related.

As you may know, there are three different things:

  • memlock - yes, this prevents swapping though often common to just have no swap on the host (the default kernel swappiness causes nightmares with this). Usually this is on/off, and set as part or Elasticsearch config (as it makes this call on startup).

  • memlock unlimited - This the size of lockable memory, which you can set to unlimited size, so ES can lock its heap. This is set as part of ulimits, but not that common.

  • ulimit - This is user limits and usually relates to open files and the default is often not large enough so 64K is recommended but also is used to set the memlock unlimited, so it's confusing.

So when I saw ulimit and swapping, it looked like a mixup to me, though you can see they are related. But ulimit usually is about open files for most people.

@matt-m
I am assuming you are using docker's default networking (bridge network).

If you do not add containers to a user-defined network like esnet, containers will be able to reference each other only by ip not by name like es01. So seed_hosts will require ips which cannot be predicted before-hand.

If you needed a one time setup you can ignore this post.

If you want a repeatable solution you may need to investigate it. Try running this docker-compose on another machine.

Following commands my help debug

# list networks. (ignore those with name, bridge, host, none)
docker network ls 

# list networks contains connected to. You may need to format using jq or python.
docker inspect es01 --format '{{json .NetworkSettings.Networks}}'

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.