Master Nodes not communicating properly in AWS

USA-RedDragon · July 1, 2019, 6:34pm

I am using Hashicorp Nomad in AWS with Docker images based on the official ones(docker.elastic.co/elasticsearch/elasticsearch:7.2.0)

The security groups allow communication properly, as I can verify with netcat inside the container once deployed.

I'm getting a lot of NodeNotConnectedExceptions, and I'm not sure what I can do about it. Once in about every 10 deploys it runs perfectly, until I redeploy, and it fails with a similar log again.

If I don't get NodeNotConnectedExceptions, I get CoordinationStateRejectedExceptions or instances seemingly fighting over the role of master: https://pastebin.com/pFmKrLtq (note I tried this with 7.0.0, but experienced it with 7.2.0 as well)

My Dockerfile:

ARG ELK_VERSION
FROM docker.elastic.co/elasticsearch/elasticsearch:${ELK_VERSION}
COPY config/elasticsearch.yml /usr/share/elasticsearch/config/elasticsearch.yml
ENV ES_JAVA_OPTS "-Xmx2g -Xms2g"
RUN echo "vm.max_map_count = 262144" > /etc/sysctl.conf

My configuration:

network:
  host: "0.0.0.0"
  bind_host: "0.0.0.0"

bootstrap.memory_lock: true

cluster:
  name: "clevyr-elk-cluster"
  initial_master_nodes:
    - elasticsearch-master-0
    - elasticsearch-master-1
    - elasticsearch-master-2

discovery:
  seed_providers: settings
  seed_hosts:
    - 172.31.29.199
    - 172.31.33.10
    - 172.31.11.195

node:
  max_local_storage_nodes: 3

node.name is coming from Nomad, where NOMAD_GROUP_NAME is elasticsearch-master and NOMAD_ALLOC_INDEX is from 0 to 2. network.publish_host is set to the private IP of the EC2 instances:

node.name=${NOMAD_GROUP_NAME}-${NOMAD_ALLOC_INDEX}
network.publish_host=${NOMAD_IP_rest}

Log messages from a clean run with no data: https://pastebin.com/bAud1hWd

Note the complaints about 172.31.11.195:9300 not connected, but from this instance, inside the container I can run:

nc -v 172.31.11.195 9300

and it connects successfully

DavidTurner · July 2, 2019, 5:52am

Your log, specifically the NodeNotConnectedExceptions, indicates that the nodes are able to connect to each other, but something outside of Elasticsearch is then breaking these connections after a few messages. Often this is caused by an overenthusiastic IDS.

system · July 30, 2019, 5:52am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.