ElasticSearch 5.4 Nodes unable to join cluster - Troubleshooting

techpanga · May 20, 2017, 6:42pm

Hi There,

We are running into some issues post upgrade from 2.3.x to 5.4.

OS : RHEL 7
Java : JDK1.8.0_111

Server (VM) #1:

 node.name: node-1
network.host: xxx.xxx.197.14
cluster.name: dsinke3
node.master: true
node.data: true
path.data: /elkstore/elasticsearch
path.logs: /elkstore/logs
bootstrap.memory_lock: true
discovery.zen.ping.unicast.hosts: ["xxx.xxx.197.14:9200", "xxx.xxx.197.15:9200", "xxx.xxx.197.16:9200", "xxx.xxx.197.17:9200", "xxx.xxx.197.18:9200"]
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping_timeout: 100s
discovery.zen.fd.ping_timeout: 100s
http.cors.enabled: true
http.cors.allow-origin: "*"

Server (VM) #2:

 node.name: node-2
network.host: xxx.xxx.197.15
cluster.name: dsinke3
node.master: true
node.data: true
path.data: /elkstore/elasticsearch
path.logs: /elkstore/logs
bootstrap.memory_lock: true
discovery.zen.ping.unicast.hosts: ["xxx.xxx.197.14:9200", "xxx.xxx.197.15:9200", "xxx.xxx.197.16:9200", "xxx.xxx.197.17:9200", "xxx.xxx.197.18:9200"]
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping_timeout: 100s
discovery.zen.fd.ping_timeout: 100s
http.cors.enabled: true
http.cors.allow-origin: "*"

Server (VM) #3:

 node.name: node-3
network.host: xxx.xxx.197.16
cluster.name: dsinke3
node.master: true
node.data: true
path.data: /elkstore/elasticsearch
path.logs: /elkstore/logs
bootstrap.memory_lock: true
discovery.zen.ping.unicast.hosts: ["xxx.xxx.197.14:9200", "xxx.xxx.197.15:9200", "xxx.xxx.197.16:9200", "xxx.xxx.197.17:9200", "xxx.xxx.197.18:9200"]
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping_timeout: 100s
discovery.zen.fd.ping_timeout: 100s
http.cors.enabled: true
http.cors.allow-origin: "*"

Server (VM) #4:

 node.name: node-4
network.host: xxx.xxx.197.17
cluster.name: dsinke3
node.master: true
node.data: true
path.data: /elkstore/elasticsearch
path.logs: /elkstore/logs
bootstrap.memory_lock: true
discovery.zen.ping.unicast.hosts: ["xxx.xxx.197.14:9200", "xxx.xxx.197.15:9200", "xxx.xxx.197.16:9200", "xxx.xxx.197.17:9200", "xxx.xxx.197.18:9200"]
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping_timeout: 100s
discovery.zen.fd.ping_timeout: 100s
http.cors.enabled: true
http.cors.allow-origin: "*"

Server (VM) #5:

 node.name: node-5
network.host: xxx.xxx.197.18
cluster.name: dsinke3
node.master: true
node.data: true
path.data: /elkstore/elasticsearch
path.logs: /elkstore/logs
bootstrap.memory_lock: true
discovery.zen.ping.unicast.hosts: ["xxx.xxx.197.14:9200", "xxx.xxx.197.15:9200", "xxx.xxx.197.16:9200", "xxx.xxx.197.17:9200", "xxx.xxx.197.18:9200"]
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping_timeout: 100s
discovery.zen.fd.ping_timeout: 100s
http.cors.enabled: true
http.cors.allow-origin: "*"

What would be the smooth starting order for this 5 nodes to be in cluster dsinke3?

I am running into these issues...

org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
org.elasticsearch.transport.ConnectTransportException: [][xxx.xxx.197.15:9200] handshake_timeout[1.6m]
[o.e.d.z.UnicastZenPing ] [node-3] [6] failed to ping

Please help.

Thanks in advance, dp

Christian_Dahlqvist · May 20, 2017, 7:45pm

As you have 5 master eligible nodes, minimum_master_nodes should be set to 3, not 2. With the current configuration you could end up with a split cluster.

techpanga · May 20, 2017, 8:02pm

Hi Christian,

Thanks for response.

I updated the discovery zen minimum master nodes to 3.

and reduced the timeouts to 10s. I am getting the below error.

[node-4] [13] failed to ping {#zen_unicast_xxx.xxx.197.18:9200_0#}{z5D73ZMaQbSqgfJcrq9t3A}{xxx.xxx.197.18}{xxx.xxx.197.18:9200}
org.elasticsearch.transport.ConnectTransportException: [][xxx.xxx.197.18:9200] handshake_timeout[10s]

I also see the below exception in logs...

[2017-05-20T19:55:35,600][WARN ][o.e.t.n.Netty4Transport ] [node-5] exception caught on transport layer [[id: 0x9445bdb5, L:/10.156.197.18:34094 - R:/10.156.197.15:9200]], closing connection
io.netty.handler.codec.DecoderException: java.io.StreamCorruptedException: invalid internal transport message format, got (48,54,54,50)

The below log statement repeating every few sec.
[o.e.d.z.ZenDiscovery ] [node-3] not enough master nodes discovered during pinging (found [[Candidate{node={node-3}{KtGoZr7YSmeTuMHTMquQJQ}{nso4kJ5qSzag5zE2sRZ3Sw}{xxx.xxx.197.16}{xxx.xxx.197.16:9300}, clusterStateVersion=-1}]], but needed [3]), pinging again

I do see telnet at port 9200 && 9300 are good.

Christian_Dahlqvist · May 20, 2017, 8:22pm

Unicast port should be 9300, not 9200, as this is the HTTP port.

techpanga · May 20, 2017, 8:36pm

Hi Christian,

I removed :9200 from

discovery.zen.ping.unicast.hosts:

Its started working.

Thanks for your help.
-dp

system · June 17, 2017, 8:36pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Node is not joining the cluster (ES-5.6.3) Elasticsearch	7	1924	December 14, 2017
Clustering with Elasticsearch issues Elasticsearch	13	1415	July 5, 2017
Node discovery issue post upgrade to newer version of ELK stack Elasticsearch	18	1450	September 20, 2017
Getting issue while configuring elasticsearch in clustered mode Elasticsearch	3	995	March 23, 2017
Nodes not joining cluster on Centos 6.2 using ElasticSearch 5.2.2 Elasticsearch	9	787	May 3, 2017