ElasticSearch 5.4 Nodes unable to join cluster - Troubleshooting

techpanga · May 20, 2017, 6:42pm

Hi There,

We are running into some issues post upgrade from 2.3.x to 5.4.

OS : RHEL 7
Java : JDK1.8.0_111

Server (VM) #1:

 node.name: node-1
network.host: xxx.xxx.197.14
cluster.name: dsinke3
node.master: true
node.data: true
path.data: /elkstore/elasticsearch
path.logs: /elkstore/logs
bootstrap.memory_lock: true
discovery.zen.ping.unicast.hosts: ["xxx.xxx.197.14:9200", "xxx.xxx.197.15:9200", "xxx.xxx.197.16:9200", "xxx.xxx.197.17:9200", "xxx.xxx.197.18:9200"]
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping_timeout: 100s
discovery.zen.fd.ping_timeout: 100s
http.cors.enabled: true
http.cors.allow-origin: "*"

Server (VM) #2:

 node.name: node-2
network.host: xxx.xxx.197.15
cluster.name: dsinke3
node.master: true
node.data: true
path.data: /elkstore/elasticsearch
path.logs: /elkstore/logs
bootstrap.memory_lock: true
discovery.zen.ping.unicast.hosts: ["xxx.xxx.197.14:9200", "xxx.xxx.197.15:9200", "xxx.xxx.197.16:9200", "xxx.xxx.197.17:9200", "xxx.xxx.197.18:9200"]
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping_timeout: 100s
discovery.zen.fd.ping_timeout: 100s
http.cors.enabled: true
http.cors.allow-origin: "*"

Server (VM) #3:

 node.name: node-3
network.host: xxx.xxx.197.16
cluster.name: dsinke3
node.master: true
node.data: true
path.data: /elkstore/elasticsearch
path.logs: /elkstore/logs
bootstrap.memory_lock: true
discovery.zen.ping.unicast.hosts: ["xxx.xxx.197.14:9200", "xxx.xxx.197.15:9200", "xxx.xxx.197.16:9200", "xxx.xxx.197.17:9200", "xxx.xxx.197.18:9200"]
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping_timeout: 100s
discovery.zen.fd.ping_timeout: 100s
http.cors.enabled: true
http.cors.allow-origin: "*"

Server (VM) #4:

 node.name: node-4
network.host: xxx.xxx.197.17
cluster.name: dsinke3
node.master: true
node.data: true
path.data: /elkstore/elasticsearch
path.logs: /elkstore/logs
bootstrap.memory_lock: true
discovery.zen.ping.unicast.hosts: ["xxx.xxx.197.14:9200", "xxx.xxx.197.15:9200", "xxx.xxx.197.16:9200", "xxx.xxx.197.17:9200", "xxx.xxx.197.18:9200"]
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping_timeout: 100s
discovery.zen.fd.ping_timeout: 100s
http.cors.enabled: true
http.cors.allow-origin: "*"

Server (VM) #5:

 node.name: node-5
network.host: xxx.xxx.197.18
cluster.name: dsinke3
node.master: true
node.data: true
path.data: /elkstore/elasticsearch
path.logs: /elkstore/logs
bootstrap.memory_lock: true
discovery.zen.ping.unicast.hosts: ["xxx.xxx.197.14:9200", "xxx.xxx.197.15:9200", "xxx.xxx.197.16:9200", "xxx.xxx.197.17:9200", "xxx.xxx.197.18:9200"]
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping_timeout: 100s
discovery.zen.fd.ping_timeout: 100s
http.cors.enabled: true
http.cors.allow-origin: "*"

What would be the smooth starting order for this 5 nodes to be in cluster dsinke3?

I am running into these issues...

org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
org.elasticsearch.transport.ConnectTransportException: [][xxx.xxx.197.15:9200] handshake_timeout[1.6m]
[o.e.d.z.UnicastZenPing ] [node-3] [6] failed to ping

Please help.

Thanks in advance, dp

Christian_Dahlqvist · May 20, 2017, 7:45pm

As you have 5 master eligible nodes, minimum_master_nodes should be set to 3, not 2. With the current configuration you could end up with a split cluster.

techpanga · May 20, 2017, 8:02pm

Hi Christian,

Thanks for response.

I updated the discovery zen minimum master nodes to 3.

and reduced the timeouts to 10s. I am getting the below error.

[node-4] [13] failed to ping {#zen_unicast_xxx.xxx.197.18:9200_0#}{z5D73ZMaQbSqgfJcrq9t3A}{xxx.xxx.197.18}{xxx.xxx.197.18:9200}
org.elasticsearch.transport.ConnectTransportException: [][xxx.xxx.197.18:9200] handshake_timeout[10s]

I also see the below exception in logs...

[2017-05-20T19:55:35,600][WARN ][o.e.t.n.Netty4Transport ] [node-5] exception caught on transport layer [[id: 0x9445bdb5, L:/10.156.197.18:34094 - R:/10.156.197.15:9200]], closing connection
io.netty.handler.codec.DecoderException: java.io.StreamCorruptedException: invalid internal transport message format, got (48,54,54,50)

The below log statement repeating every few sec.
[o.e.d.z.ZenDiscovery ] [node-3] not enough master nodes discovered during pinging (found [[Candidate{node={node-3}{KtGoZr7YSmeTuMHTMquQJQ}{nso4kJ5qSzag5zE2sRZ3Sw}{xxx.xxx.197.16}{xxx.xxx.197.16:9300}, clusterStateVersion=-1}]], but needed [3]), pinging again

I do see telnet at port 9200 && 9300 are good.

Christian_Dahlqvist · May 20, 2017, 8:22pm

Unicast port should be 9300, not 9200, as this is the HTTP port.

techpanga · May 20, 2017, 8:36pm

Hi Christian,

I removed :9200 from

discovery.zen.ping.unicast.hosts:

Its started working.

Thanks for your help.
-dp

system · June 17, 2017, 8:36pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ElasticSearch 5.4 - One node of 5 nodes not joining the cluster Elasticsearch	3	1288	May 21, 2017
Data nodes are not able to join master node and failed to make a cluster Elasticsearch	14	2590	October 5, 2018
Master not discovered exception Elasticsearch	8	12946	June 17, 2017
Not able to create a cluster with ES 5.6? Elasticsearch	9	4887	October 17, 2017
Adding a second node Elasticsearch	2	848	May 26, 2017