Master Nodes not communicating properly in AWS

I am using Hashicorp Nomad in AWS with Docker images based on the official ones(docker.elastic.co/elasticsearch/elasticsearch:7.2.0)

The security groups allow communication properly, as I can verify with netcat inside the container once deployed.

I'm getting a lot of NodeNotConnectedExceptions, and I'm not sure what I can do about it. Once in about every 10 deploys it runs perfectly, until I redeploy, and it fails with a similar log again.

If I don't get NodeNotConnectedExceptions, I get CoordinationStateRejectedExceptions or instances seemingly fighting over the role of master: https://pastebin.com/pFmKrLtq (note I tried this with 7.0.0, but experienced it with 7.2.0 as well)

My Dockerfile:

ARG ELK_VERSION
FROM docker.elastic.co/elasticsearch/elasticsearch:${ELK_VERSION}
COPY config/elasticsearch.yml /usr/share/elasticsearch/config/elasticsearch.yml
ENV ES_JAVA_OPTS "-Xmx2g -Xms2g"
RUN echo "vm.max_map_count = 262144" > /etc/sysctl.conf

My configuration:

network:
  host: "0.0.0.0"
  bind_host: "0.0.0.0"

bootstrap.memory_lock: true

cluster:
  name: "clevyr-elk-cluster"
  initial_master_nodes:
    - elasticsearch-master-0
    - elasticsearch-master-1
    - elasticsearch-master-2

discovery:
  seed_providers: settings
  seed_hosts:
    - 172.31.29.199
    - 172.31.33.10
    - 172.31.11.195

node:
  max_local_storage_nodes: 3

node.name is coming from Nomad, where NOMAD_GROUP_NAME is elasticsearch-master and NOMAD_ALLOC_INDEX is from 0 to 2. network.publish_host is set to the private IP of the EC2 instances:

node.name=${NOMAD_GROUP_NAME}-${NOMAD_ALLOC_INDEX}
network.publish_host=${NOMAD_IP_rest}

Log messages from a clean run with no data: https://pastebin.com/bAud1hWd

Note the complaints about 172.31.11.195:9300 not connected, but from this instance, inside the container I can run:

nc -v 172.31.11.195 9300

and it connects successfully

Your log, specifically the NodeNotConnectedExceptions, indicates that the nodes are able to connect to each other, but something outside of Elasticsearch is then breaking these connections after a few messages. Often this is caused by an overenthusiastic IDS.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.