I am using Hashicorp Nomad in AWS with Docker images based on the official ones(docker.elastic.co/elasticsearch/elasticsearch:7.2.0)
The security groups allow communication properly, as I can verify with netcat inside the container once deployed.
I'm getting a lot of NodeNotConnectedExceptions, and I'm not sure what I can do about it. Once in about every 10 deploys it runs perfectly, until I redeploy, and it fails with a similar log again.
If I don't get NodeNotConnectedExceptions, I get CoordinationStateRejectedExceptions or instances seemingly fighting over the role of master: https://pastebin.com/pFmKrLtq (note I tried this with 7.0.0, but experienced it with 7.2.0 as well)
My Dockerfile:
ARG ELK_VERSION
FROM docker.elastic.co/elasticsearch/elasticsearch:${ELK_VERSION}
COPY config/elasticsearch.yml /usr/share/elasticsearch/config/elasticsearch.yml
ENV ES_JAVA_OPTS "-Xmx2g -Xms2g"
RUN echo "vm.max_map_count = 262144" > /etc/sysctl.conf
My configuration:
network:
host: "0.0.0.0"
bind_host: "0.0.0.0"
bootstrap.memory_lock: true
cluster:
name: "clevyr-elk-cluster"
initial_master_nodes:
- elasticsearch-master-0
- elasticsearch-master-1
- elasticsearch-master-2
discovery:
seed_providers: settings
seed_hosts:
- 172.31.29.199
- 172.31.33.10
- 172.31.11.195
node:
max_local_storage_nodes: 3
node.name is coming from Nomad, where NOMAD_GROUP_NAME is elasticsearch-master and NOMAD_ALLOC_INDEX is from 0 to 2. network.publish_host is set to the private IP of the EC2 instances:
node.name=${NOMAD_GROUP_NAME}-${NOMAD_ALLOC_INDEX}
network.publish_host=${NOMAD_IP_rest}
Log messages from a clean run with no data: https://pastebin.com/bAud1hWd
Note the complaints about 172.31.11.195:9300 not connected, but from this instance, inside the container I can run:
nc -v 172.31.11.195 9300
and it connects successfully