ES under Docker Swarm fails with "found existing node with the same id but is a different node instance"


(Saul) #1

Hello All,

Although I know my way (some) around Docker, I am very new to ES.

Some environmental information:

centos-release-7-4.1708.el7.centos.x86_64
Docker version 17.12.0-ce, build c97c6d6
ES Version: 5.6.7, Build: 4669214/2018-01-25T21:14:50.776Z, JVM: 1.8.0_151

Based on stuff I've read on the Web, it may be that the Dockerized version of ES (pulled via 'elasticsearch:5') is not the official ES version, but one maintained by Docker.

The Docker compose yml file looks like this:

`version: "3.3"

services:
elasticsearch:

command: >
  elasticsearch
  -E discovery.zen.ping.unicast.hosts=elasticsearch            
  -E discovery.zen.minimum_master_nodes=1
  -E node.max_local_storage_nodes=1
  -E network.host=0.0.0.0  

image: elasticsearch:5                                          # unofficial (from Docker) image; runs as user root
deploy:
  mode: replicated                                              
  endpoint_mode: dnsrr                                          
volumes:
  - type: volume
    source: nfs_share
    target: /usr/share/elasticsearch/data
    volume:
      nocopy: true    

nginx:
image: 'nginx:1'
ports:
- target: 9200
published: 9200
protocol: tcp
mode: ingress
command: |
/bin/bash -c "echo '
server {
listen 9200;
add_header X-Frame-Options "SAMEORIGIN";
client_max_body_size 64M;

    location / {
        proxy_pass http://elasticsearch:9200;
        proxy_http_version 1.1;
        proxy_set_header Connection keep-alive;
        proxy_set_header Upgrade $$http_upgrade;
        proxy_set_header Host $$host;
        proxy_set_header X-Real-IP $$remote_addr;
        proxy_cache_bypass $$http_upgrade;
    }
  }' | tee /etc/nginx/conf.d/default.conf && nginx -g 'daemon off;'"          

volumes:
nfs_share:
driver: local
driver_opts:
type: nfs
o: "addr=71.100.14.14,rsize=1048576,wsize=1048576,nolock,soft,rw,timeo=600,retrans=2"
device: ":/export/zfs/saul`

I start the stack via docker stack deploy --compose-file ./esnginx2.yml nfsnginx2

nginx (under service nfsnginx2_nginx) comes up on one of the 2 CentOS nodes; ES (under service nfsnginx2_elasticsearch) on the other. So far so good.

But when I issue docker service scale nfsnginx2_elasticsearch=2 the 2nd ES node is started but it never joins the ES cluster. Its log shows:

[2018-02-19T16:08:38,198][INFO ][o.e.d.z.ZenDiscovery ] [fb5lofA] failed to send join request to master [{fb5lofA}{fb5lofA-RPqtftVfhoKeiA}{JdGOhu_PRYe0m1FgqMw3Zw}{10.0.0.4}{10.0.0.4:9300}], reason [RemoteTransportException[[fb5lofA][10.0.0.4:9300][internal:discovery/zen/join]]; nested: IllegalArgumentException[can't add node {fb5lofA}{fb5lofA-RPqtftVfhoKeiA}{zn7-jwGiSQef6_GcejrvQw}{10.0.0.17}{10.0.0.17:9300}, found existing node {fb5lofA}{fb5lofA-RPqtftVfhoKeiA}{JdGOhu_PRYe0m1FgqMw3Zw}{10.0.0.4}{10.0.0.4:9300} with the same id but is a different node instance]; ]

Some cursory reading on the Web seemed to indicate that that an ES "node UUID" is stored in the "ES data folder". Some folks said they got it to work by deleting the contents of the data folder, e.g., /var/lib/elasticsearch/nodes/0, and then restarting ES.

This doesn't seem like an especially clean solution; nor did it work for me. So I am wondering:

  1. What is the actual cause of this error?
  2. Is it possible that the ES image (again, perhaps not from ElasticSearch itself) contains a directory with a node UUID and that this is getting duplicated by the scale up and thus causing the error?
  3. What are possible approaches to resolve this?

I would be most grateful for some help with this.

Thanks.

-Saul


(system) #2

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.