ES failed to connect to master node

Genesys_Bagus · October 13, 2023, 10:27am

hi team,

i have 3 nodes elasticsearch cluster, but at one time elasticsearch couldn't function properly because the master node left (reason = shutdown)

</>
[2023-09-13T01:14:28,394][INFO ][o.e.n.Node ] [IDNDCI-VSCSPSM1] started
[2023-09-13T01:20:09,510][INFO ][o.e.d.z.ZenDiscovery ] [IDNDCI-VSCSPSM1] master_left [{IDNDCI-VSCSPGR5}{0IS_JsPsTkK7wxlKZmYcGA}{6h50Ysw_QlS6xiXp_7W1ag}{IDNDCI-VSCSPGR5}{10.162.40.47:9300}], reason [shut_down]
[2023-09-13T01:20:09,510][WARN ][o.e.d.z.ZenDiscovery ] [IDNDCI-VSCSPSM1] master left (reason = shut_down), current nodes: nodes:
{IDNDCI-VSCSPGR5}{0IS_JsPsTkK7wxlKZmYcGA}{6h50Ysw_QlS6xiXp_7W1ag}{IDNDCI-VSCSPGR5}{10.162.40.47:9300}, master
{IDNDCI-VSCSPSM1}{Qi8DZwLOSCCGb_hCYK0ayQ}{eU-NYiPlRhKM8ezq6fxEYQ}{IDNDCI-VSCSPSM1}{10.162.40.18:9300}, local
{IDNDCI-VSCSPGR7}{Ut_237WpSnO9mpaB_ACDbw}{Vbu7dLhwTuWhX0nPajW1OQ}{IDNDCI-VSCSPGR7}{10.162.40.49:9300}

[2023-09-13T01:20:09,525][WARN ][o.e.t.n.Netty4Transport ] [IDNDCI-VSCSPSM1] write and flush on the network layer failed (channel: [id: 0x745ba491, L:0.0.0.0/0.0.0.0:9300 ! R:/10.162.40.47:62087])
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2023-09-13T01:20:10,544][WARN ][o.e.c.NodeConnectionsService] [IDNDCI-VSCSPSM1] failed to connect to node {IDNDCI-VSCSPGR5}{0IS_JsPsTkK7wxlKZmYcGA}{6h50Ysw_QlS6xiXp_7W1ag}{IDNDCI-VSCSPGR5}{10.162.40.47:9300} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [IDNDCI-VSCSPGR5][10.162.40.47:9300] connect_timeout[30s]
at org.elasticsearch.transport.netty4.Netty4Transport.connectToChannels(Netty4Transport.java:363) ~[?:?]
at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:570) ~[elasticsearch-5.6.16.jar:5.6.16]
at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:473) ~[elasticsearch-5.6.16.jar:5.6.16]
at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:342) ~[elasticsearch-5.6.16.jar:5.6.16]
at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:329) ~[elasticsearch-5.6.16.jar:5.6.16]
at org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:154) [elasticsearch-5.6.16.jar:5.6.16]
at org.elasticsearch.cluster.NodeConnectionsService$1.doRun(NodeConnectionsService.java:107) [elasticsearch-5.6.16.jar:5.6.16]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:675) [elasticsearch-5.6.16.jar:5.6.16]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.16.jar:5.6.16]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:1.8.0_271]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:1.8.0_271]
at java.lang.Thread.run(Unknown Source) [?:1.8.0_271]
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: no further information: IDNDCI-VSCSPGR5/10.162.40.47:9300
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source) ~[?:?]
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:352) ~[?:?]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) ~[?:?]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) ~[?:?]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) ~[?:?]
... 1 more
Caused by: java.net.ConnectException: Connection refused: no further information
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source) ~[?:?]
at
</>

this is the config of elasticsearch.yaml

</>
bootstrap.memory_lock: true
cluster.name: prd_sm_es_cluster
discovery.zen.ping.unicast.hosts:

IDNDCI-VSCSPSM1
IDNDCI-VSCSPGR5
IDNDCI-VSCSPGR7
http.port: 9200
</>

and after i change the elasticsearch.yaml like this :

</>
bootstrap.memory_lock: true
cluster.name: prd_sm_es_cluster
discovery.zen.ping.unicast.hosts:

IDNDCI-VSCSPSM1:9300
IDNDCI-VSCSPGR5:9300
IDNDCI-VSCSPGR7:9300
http.port: 9200
</>

the nodes can communicate to master node again.
i need to find the rootcause of this issue
-the 9300 port is opened
-those 3 hosts are in 1 segment of network
-no port block from firewalls.

Thanks and regards,
Adira

Christian_Dahlqvist · October 13, 2023, 10:40am

It looks like you are running a very old version of Elasticsearch that has been EOL a very long time. I recommend that you upgrade to the latest version.

In order to have a highly available and reliable cluster using this version it is vital to have discover.zen.minimum_master_nodes set correctly. If you have 3 master eligible nodes this parameter must be set to 2 on all nodes.

Please also show the full configuration for all nodes, formatted correctly using the tools available here.

Genesys_Bagus · October 16, 2023, 3:23am

hi this is the configuration of elasticsearch.yaml in every nodes

bootstrap.memory_lock: true
cluster.name: prd_sm_es_cluster
discovery.zen.ping.unicast.hosts:
  - IDNDCI-VSCSPSM1:9300
  - IDNDCI-VSCSPGR5:9300
  - IDNDCI-VSCSPGR7:9300
http.port: 9200
network.host: IDNDCI-VSCSPSM1
node.data: true
node.ingest: false
node.master: true
node.max_local_storage_nodes: 1
node.name: IDNDCI-VSCSPSM1
path.data: E:\elasticsearch_data
path.logs: E:\gcti_logs\sm_elasticsearch
transport.tcp.port: 9300

bootstrap.memory_lock: true
cluster.name: prd_sm_es_cluster
discovery.zen.ping.unicast.hosts:
  - IDNDCI-VSCSPSM1:9300
  - IDNDCI-VSCSPGR5:9300
  - IDNDCI-VSCSPGR7:9300
http.port: 9200
network.host: IDNDCI-VSCSPGR5
node.data: true
node.ingest: false
node.master: true
node.max_local_storage_nodes: 1
node.name: IDNDCI-VSCSPGR5
path.data: E:\elasticsearch_data
path.logs: E:\gcti_logs\sm_elasticsearch
transport.tcp.port: 9300

bootstrap.memory_lock: true
cluster.name: prd_sm_es_cluster
discovery.zen.ping.unicast.hosts:
  - IDNDCI-VSCSPSM1:9300
  - IDNDCI-VSCSPGR5:9300
  - IDNDCI-VSCSPGR7:9300
http.port: 9200
network.host: IDNDCI-VSCSPGR7
node.data: true
node.ingest: false
node.master: true
node.max_local_storage_nodes: 1
node.name: IDNDCI-VSCSPGR7
path.data: E:\elasticsearch_data
path.logs: E:\gcti_logs\sm_elasticsearch
transport.tcp.port: 9300

Genesys_Bagus · October 16, 2023, 3:27am

does this issue occur because this version of elasticsearch has a bug or something that made the connection refused through nodes?

Christian_Dahlqvist · October 16, 2023, 3:43am

You do not have the setting i linked to set correctly, which means your cluster is misconfigured. You first need ro fix this.

I have not used this old version for years, and not on Windows, so if adding the setting does not resilve the issue i suspect i will not be able to help much.

Genesys_Bagus · October 16, 2023, 4:02am

i think there's misunderstading here, this issue has been resolved by adding the 9300 port in discovery.zen.ping.unicast.hosts , and after that , the elasticsearch can connect to each other nodes again.

i'm posted this because i want to find the rootcause why this issue occurs.

Christian_Dahlqvist · October 16, 2023, 4:24am

Your cluster is still misconfigured, which can lead to split-brain scenarios and data loss. This can lead to nodes not connecting and forming separate clusters.

Genesys_Bagus · October 16, 2023, 4:39am

i believe this is the rootcause. but why does this "misconfigured" config can works from the first deployment (2 years ago). and then suddenly causing this issue?

Christian_Dahlqvist · October 16, 2023, 4:43am

It could be the root cause, but there could be other factors at play as well. The fact that the cluster is not correctly configured does make it difficult to identify any other potential issues though. This misconfiguration does not make clusters fail immediately and clusters can run fine for long periods of time without issues as it is only specific network/failure conditions might cause split brains. I would recommend reading the docs I linked to for more details.

This setting was often set incorrectly in earlier versions of Elasticsearch, leading to different kind of issues. This is why this area was reworked in Elasticsearch 7.0, where the setting has been removed as part of an effort to increase resilience.

Genesys_Bagus · October 18, 2023, 6:42am

great, thank you christian for your explanation. i would try to read the docs that you linked before.

Best Regards,
Bagus

system · November 15, 2023, 6:43am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
All shards failed after one node shutdown Elasticsearch	2	1283	January 19, 2018
Elasticsearch 3 node cluster failing if master is down second time Elasticsearch	9	3149	June 25, 2017
ES 7.2.0 Master fails to join cluster Elasticsearch	3	2865	August 19, 2019
Failed to send join request to master Elasticsearch	2	479	July 6, 2017
Unable to connect to Master node from Data node in ElasticSearch Elasticsearch	6	1520	July 26, 2019

ES failed to connect to master node

Related topics