Two of the twelve nodes not joining the cluster


#1

Hello,

I have setup a cluster of twelve nodes (node 2 is master, others are data nodes) but two nodes are missing. I've googled a lot before coming here but I cannot find the fix I need. The content of elasticsearch.yml is identical to the file on the other nodes so I don't know why they are not joining. This is what is inside the file:

   cluster.name: datascience-lab    
   node.name: ds-es-4
   node.data: true
   network.host: 0.0.0.0
   http.port: 9200
   discovery.zen.ping.unicast.hosts: ["10.230.7.2:9300", "10.230.7.3:9300", "10.230.7.4:9300"]
   discovery.zen.minimum_master_nodes: 7

The nodes can ping with each other and when I do a curl localhost:9200 on the missing nodes they answer

{
  "name" : "ds-es-4",
  "cluster_name" : "datascience-lab",
  "version" : {
    "number" : "2.4.0",
    "build_hash" : "ce9f0c7394dee074091dd1bc4e9469251181fc55",
    "build_timestamp" : "2016-08-29T09:14:17Z",
    "build_snapshot" : false,
    "lucene_version" : "5.5.2"
  },
  "tagline" : "You Know, for Search"
}

But when I check for cluster health on the missing nodes then I get this after 10 seconds:

{
  "error" : {
    "root_cause" : [ {
      "type" : "master_not_discovered_exception",
      "reason" : null
    } ],
    "type" : "master_not_discovered_exception",
    "reason" : null
  },
  "status" : 503
}

Any ideas what to try?


(Mark O Stewart) #2

Did they ever join?
If no maybe a firewall or switch routing issue.

If yes have you tried netstat to see if the nodes are listening on the transport port?
curl to localhost is HTMLand can get an answer from a single node but the nodes need the transport open to communicate with each other.

Hope this helps some.


#3

Thanks for your reply!

Good question, I can't recall if they have ever succesfully joined...
I tried netstat and I can't see the missing nodes listening whilst I can see the others.

Have to visit the networkguys if they can do anything about it.


(Mark O Stewart) #4

If the nodes aren't listening it is a local config problem. The network guys can open firewall ports and config switches but if the elasticsearch is not listening on transport ports it still won't work.

Ensure there are no white spaces or misconfigurations in the elasticsearch.yml.
Maybe copy one Elasticsearch.yml from one node to another and diff the files to ensure that they are the same around the transport and unicast lines.
Look closely as I have had similar hard to spot issues. I think I had a white space at the beginning of the transport line one time and it broke the transport port config.


#5

I've done the copy/paste thing but it did not work. Re-installed elasticsearch but they are still not showing up. This is the log on the missing node:

What could be wrong when it says no route to host?

[2016-10-20 08:51:34,883][INFO ][node                     ] [ds-es-4] version[2.4.0], pid[31099], build[ce9f0c7/2016-08-29T09:14:17Z]
[2016-10-20 08:51:34,885][INFO ][node                     ] [ds-es-4] initializing ...
[2016-10-20 08:51:35,461][INFO ][plugins                  ] [ds-es-4] modules [reindex, lang-expression, lang-groovy], plugins [], sites []
[2016-10-20 08:51:35,483][INFO ][env                      ] [ds-es-4] using [1] data paths, mounts [[/ (rootfs)]], net usable_space [722.9gb], net total_space [725.1gb], spins? [unknown], types [rootfs]
[2016-10-20 08:51:35,483][INFO ][env                      ] [ds-es-4] heap size [989.8mb], compressed ordinary object pointers [true]
[2016-10-20 08:51:37,241][INFO ][node                     ] [ds-es-4] initialized
[2016-10-20 08:51:37,241][INFO ][node                     ] [ds-es-4] starting ...
[2016-10-20 08:51:37,308][INFO ][transport                ] [ds-es-4] publish_address {10.230.7.4:9300}, bound_addresses {[::]:9300}
[2016-10-20 08:51:37,314][INFO ][discovery                ] [ds-es-4] datascience-lab/UmtO_rbkTbWa-tffwThREA
[2016-10-20 08:51:40,354][WARN ][transport                ] [ds-es-4] Transport response handler not found of id [9]
[2016-10-20 08:51:40,356][INFO ][discovery.zen            ] [ds-es-4] failed to send join request to master [{ds-es-2}{EWeLqildQEO3z3hZRlJ-Rg}{10.230.7.2}{10.230.7.2:9300}], reason [RemoteTransportException[[ds-es-2][10.230.7.2:9300][internal:discovery/zen/join]]; nested: ConnectTransportException[[ds-es-4][10.230.7.4:9300] connect_timeout[30s]]; nested: NotSerializableExceptionWrapper[no_route_to_host_exception: No route to host]; ]
[2016-10-20 08:51:43,367][WARN ][transport                ] [ds-es-4] Transport response handler not found of id [19]
[2016-10-20 08:51:43,368][INFO ][discovery.zen            ] [ds-es-4] failed to send join request to master [{ds-es-2}{EWeLqildQEO3z3hZRlJ-Rg}{10.230.7.2}{10.230.7.2:9300}], reason [RemoteTransportException[[ds-es-2][10.230.7.2:9300][internal:discovery/zen/join]]; nested: ConnectTransportException[[ds-es-4][10.230.7.4:9300] connect_timeout[30s]]; nested: NotSerializableExceptionWrapper[no_route_to_host_exception: No route to host]; ]
[2016-10-20 08:51:46,375][WARN ][transport                ] [ds-es-4] Transport response handler not found of id [29]
[2016-10-20 08:51:46,379][INFO ][discovery.zen            ] [ds-es-4] failed to send join request to master [{ds-es-2}{EWeLqildQEO3z3hZRlJ-Rg}{10.230.7.2}{10.230.7.2:9300}], reason [RemoteTransportException[[ds-es-2][10.230.7.2:9300][internal:discovery/zen/join]]; nested: ConnectTransportException[[ds-es-4][10.230.7.4:9300] connect_timeout[30s]]; nested: NotSerializableExceptionWrapper[no_route_to_host_exception: No route to host]; ]
[2016-10-20 08:51:49,387][WARN ][transport                ] [ds-es-4] Transport response handler not found of id [39]

(David Pilato) #6

Update to 2.4.1


(system) #7