ElasticSearch 5.4 Nodes unable to join cluster - Troubleshooting

Hi There,

We are running into some issues post upgrade from 2.3.x to 5.4.

OS : RHEL 7
Java : JDK1.8.0_111

Server (VM) #1:

 node.name: node-1
network.host: xxx.xxx.197.14
cluster.name: dsinke3
node.master: true
node.data: true
path.data: /elkstore/elasticsearch
path.logs: /elkstore/logs
bootstrap.memory_lock: true
discovery.zen.ping.unicast.hosts: ["xxx.xxx.197.14:9200", "xxx.xxx.197.15:9200", "xxx.xxx.197.16:9200", "xxx.xxx.197.17:9200", "xxx.xxx.197.18:9200"]
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping_timeout: 100s
discovery.zen.fd.ping_timeout: 100s
http.cors.enabled: true
http.cors.allow-origin: "*"

Server (VM) #2:

 node.name: node-2
network.host: xxx.xxx.197.15
cluster.name: dsinke3
node.master: true
node.data: true
path.data: /elkstore/elasticsearch
path.logs: /elkstore/logs
bootstrap.memory_lock: true
discovery.zen.ping.unicast.hosts: ["xxx.xxx.197.14:9200", "xxx.xxx.197.15:9200", "xxx.xxx.197.16:9200", "xxx.xxx.197.17:9200", "xxx.xxx.197.18:9200"]
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping_timeout: 100s
discovery.zen.fd.ping_timeout: 100s
http.cors.enabled: true
http.cors.allow-origin: "*"

Server (VM) #3:

 node.name: node-3
network.host: xxx.xxx.197.16
cluster.name: dsinke3
node.master: true
node.data: true
path.data: /elkstore/elasticsearch
path.logs: /elkstore/logs
bootstrap.memory_lock: true
discovery.zen.ping.unicast.hosts: ["xxx.xxx.197.14:9200", "xxx.xxx.197.15:9200", "xxx.xxx.197.16:9200", "xxx.xxx.197.17:9200", "xxx.xxx.197.18:9200"]
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping_timeout: 100s
discovery.zen.fd.ping_timeout: 100s
http.cors.enabled: true
http.cors.allow-origin: "*"

Server (VM) #4:

 node.name: node-4
network.host: xxx.xxx.197.17
cluster.name: dsinke3
node.master: true
node.data: true
path.data: /elkstore/elasticsearch
path.logs: /elkstore/logs
bootstrap.memory_lock: true
discovery.zen.ping.unicast.hosts: ["xxx.xxx.197.14:9200", "xxx.xxx.197.15:9200", "xxx.xxx.197.16:9200", "xxx.xxx.197.17:9200", "xxx.xxx.197.18:9200"]
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping_timeout: 100s
discovery.zen.fd.ping_timeout: 100s
http.cors.enabled: true
http.cors.allow-origin: "*"

Server (VM) #5:

 node.name: node-5
network.host: xxx.xxx.197.18
cluster.name: dsinke3
node.master: true
node.data: true
path.data: /elkstore/elasticsearch
path.logs: /elkstore/logs
bootstrap.memory_lock: true
discovery.zen.ping.unicast.hosts: ["xxx.xxx.197.14:9200", "xxx.xxx.197.15:9200", "xxx.xxx.197.16:9200", "xxx.xxx.197.17:9200", "xxx.xxx.197.18:9200"]
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping_timeout: 100s
discovery.zen.fd.ping_timeout: 100s
http.cors.enabled: true
http.cors.allow-origin: "*"

What would be the smooth starting order for this 5 nodes to be in cluster dsinke3?

I am running into these issues...

  1. org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
  2. org.elasticsearch.transport.ConnectTransportException: [][xxx.xxx.197.15:9200] handshake_timeout[1.6m]
  3. [o.e.d.z.UnicastZenPing ] [node-3] [6] failed to ping

Please help.

Thanks in advance, dp

As you have 5 master eligible nodes, minimum_master_nodes should be set to 3, not 2. With the current configuration you could end up with a split cluster.

Hi Christian,

Thanks for response.

I updated the discovery zen minimum master nodes to 3.

and reduced the timeouts to 10s. I am getting the below error.

[node-4] [13] failed to ping {#zen_unicast_xxx.xxx.197.18:9200_0#}{z5D73ZMaQbSqgfJcrq9t3A}{xxx.xxx.197.18}{xxx.xxx.197.18:9200}
org.elasticsearch.transport.ConnectTransportException: [][xxx.xxx.197.18:9200] handshake_timeout[10s]

I also see the below exception in logs...

[2017-05-20T19:55:35,600][WARN ][o.e.t.n.Netty4Transport ] [node-5] exception caught on transport layer [[id: 0x9445bdb5, L:/10.156.197.18:34094 - R:/10.156.197.15:9200]], closing connection
io.netty.handler.codec.DecoderException: java.io.StreamCorruptedException: invalid internal transport message format, got (48,54,54,50)

The below log statement repeating every few sec.
[o.e.d.z.ZenDiscovery ] [node-3] not enough master nodes discovered during pinging (found [[Candidate{node={node-3}{KtGoZr7YSmeTuMHTMquQJQ}{nso4kJ5qSzag5zE2sRZ3Sw}{xxx.xxx.197.16}{xxx.xxx.197.16:9300}, clusterStateVersion=-1}]], but needed [3]), pinging again

I do see telnet at port 9200 && 9300 are good.

Unicast port should be 9300, not 9200, as this is the HTTP port.

1 Like

Hi Christian,

I removed :9200 from

discovery.zen.ping.unicast.hosts:

Its started working.

Thanks for your help.
-dp

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.