ES failed to connect to master node

hi team,

i have 3 nodes elasticsearch cluster, but at one time elasticsearch couldn't function properly because the master node left (reason = shutdown)

</>
[2023-09-13T01:14:28,394][INFO ][o.e.n.Node ] [IDNDCI-VSCSPSM1] started
[2023-09-13T01:20:09,510][INFO ][o.e.d.z.ZenDiscovery ] [IDNDCI-VSCSPSM1] master_left [{IDNDCI-VSCSPGR5}{0IS_JsPsTkK7wxlKZmYcGA}{6h50Ysw_QlS6xiXp_7W1ag}{IDNDCI-VSCSPGR5}{10.162.40.47:9300}], reason [shut_down]
[2023-09-13T01:20:09,510][WARN ][o.e.d.z.ZenDiscovery ] [IDNDCI-VSCSPSM1] master left (reason = shut_down), current nodes: nodes:
{IDNDCI-VSCSPGR5}{0IS_JsPsTkK7wxlKZmYcGA}{6h50Ysw_QlS6xiXp_7W1ag}{IDNDCI-VSCSPGR5}{10.162.40.47:9300}, master
{IDNDCI-VSCSPSM1}{Qi8DZwLOSCCGb_hCYK0ayQ}{eU-NYiPlRhKM8ezq6fxEYQ}{IDNDCI-VSCSPSM1}{10.162.40.18:9300}, local
{IDNDCI-VSCSPGR7}{Ut_237WpSnO9mpaB_ACDbw}{Vbu7dLhwTuWhX0nPajW1OQ}{IDNDCI-VSCSPGR7}{10.162.40.49:9300}

[2023-09-13T01:20:09,525][WARN ][o.e.t.n.Netty4Transport ] [IDNDCI-VSCSPSM1] write and flush on the network layer failed (channel: [id: 0x745ba491, L:0.0.0.0/0.0.0.0:9300 ! R:/10.162.40.47:62087])
java.nio.channels.ClosedChannelException: null
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2023-09-13T01:20:10,544][WARN ][o.e.c.NodeConnectionsService] [IDNDCI-VSCSPSM1] failed to connect to node {IDNDCI-VSCSPGR5}{0IS_JsPsTkK7wxlKZmYcGA}{6h50Ysw_QlS6xiXp_7W1ag}{IDNDCI-VSCSPGR5}{10.162.40.47:9300} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [IDNDCI-VSCSPGR5][10.162.40.47:9300] connect_timeout[30s]
at org.elasticsearch.transport.netty4.Netty4Transport.connectToChannels(Netty4Transport.java:363) ~[?:?]
at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:570) ~[elasticsearch-5.6.16.jar:5.6.16]
at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:473) ~[elasticsearch-5.6.16.jar:5.6.16]
at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:342) ~[elasticsearch-5.6.16.jar:5.6.16]
at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:329) ~[elasticsearch-5.6.16.jar:5.6.16]
at org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:154) [elasticsearch-5.6.16.jar:5.6.16]
at org.elasticsearch.cluster.NodeConnectionsService$1.doRun(NodeConnectionsService.java:107) [elasticsearch-5.6.16.jar:5.6.16]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:675) [elasticsearch-5.6.16.jar:5.6.16]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.16.jar:5.6.16]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:1.8.0_271]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:1.8.0_271]
at java.lang.Thread.run(Unknown Source) [?:1.8.0_271]
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: no further information: IDNDCI-VSCSPGR5/10.162.40.47:9300
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source) ~[?:?]
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:352) ~[?:?]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) ~[?:?]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) ~[?:?]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) ~[?:?]
... 1 more
Caused by: java.net.ConnectException: Connection refused: no further information
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source) ~[?:?]
at
</>

this is the config of elasticsearch.yaml

</>
bootstrap.memory_lock: true
cluster.name: prd_sm_es_cluster
discovery.zen.ping.unicast.hosts:

  • IDNDCI-VSCSPSM1
  • IDNDCI-VSCSPGR5
  • IDNDCI-VSCSPGR7
    http.port: 9200
    </>

and after i change the elasticsearch.yaml like this :

</>
bootstrap.memory_lock: true
cluster.name: prd_sm_es_cluster
discovery.zen.ping.unicast.hosts:

  • IDNDCI-VSCSPSM1:9300
  • IDNDCI-VSCSPGR5:9300
  • IDNDCI-VSCSPGR7:9300
    http.port: 9200
    </>

the nodes can communicate to master node again.
i need to find the rootcause of this issue
-the 9300 port is opened
-those 3 hosts are in 1 segment of network
-no port block from firewalls.

Thanks and regards,
Adira

It looks like you are running a very old version of Elasticsearch that has been EOL a very long time. I recommend that you upgrade to the latest version.

In order to have a highly available and reliable cluster using this version it is vital to have discover.zen.minimum_master_nodes set correctly. If you have 3 master eligible nodes this parameter must be set to 2 on all nodes.

Please also show the full configuration for all nodes, formatted correctly using the tools available here.

1 Like

hi this is the configuration of elasticsearch.yaml in every nodes

bootstrap.memory_lock: true
cluster.name: prd_sm_es_cluster
discovery.zen.ping.unicast.hosts:
  - IDNDCI-VSCSPSM1:9300
  - IDNDCI-VSCSPGR5:9300
  - IDNDCI-VSCSPGR7:9300
http.port: 9200
network.host: IDNDCI-VSCSPSM1
node.data: true
node.ingest: false
node.master: true
node.max_local_storage_nodes: 1
node.name: IDNDCI-VSCSPSM1
path.data: E:\elasticsearch_data
path.logs: E:\gcti_logs\sm_elasticsearch
transport.tcp.port: 9300

bootstrap.memory_lock: true
cluster.name: prd_sm_es_cluster
discovery.zen.ping.unicast.hosts:
  - IDNDCI-VSCSPSM1:9300
  - IDNDCI-VSCSPGR5:9300
  - IDNDCI-VSCSPGR7:9300
http.port: 9200
network.host: IDNDCI-VSCSPGR5
node.data: true
node.ingest: false
node.master: true
node.max_local_storage_nodes: 1
node.name: IDNDCI-VSCSPGR5
path.data: E:\elasticsearch_data
path.logs: E:\gcti_logs\sm_elasticsearch
transport.tcp.port: 9300

bootstrap.memory_lock: true
cluster.name: prd_sm_es_cluster
discovery.zen.ping.unicast.hosts:
  - IDNDCI-VSCSPSM1:9300
  - IDNDCI-VSCSPGR5:9300
  - IDNDCI-VSCSPGR7:9300
http.port: 9200
network.host: IDNDCI-VSCSPGR7
node.data: true
node.ingest: false
node.master: true
node.max_local_storage_nodes: 1
node.name: IDNDCI-VSCSPGR7
path.data: E:\elasticsearch_data
path.logs: E:\gcti_logs\sm_elasticsearch
transport.tcp.port: 9300

does this issue occur because this version of elasticsearch has a bug or something that made the connection refused through nodes?

You do not have the setting i linked to set correctly, which means your cluster is misconfigured. You first need ro fix this.

I have not used this old version for years, and not on Windows, so if adding the setting does not resilve the issue i suspect i will not be able to help much.

i think there's misunderstading here, this issue has been resolved by adding the 9300 port in discovery.zen.ping.unicast.hosts , and after that , the elasticsearch can connect to each other nodes again.

i'm posted this because i want to find the rootcause why this issue occurs.

Your cluster is still misconfigured, which can lead to split-brain scenarios and data loss. This can lead to nodes not connecting and forming separate clusters.

i believe this is the rootcause. but why does this "misconfigured" config can works from the first deployment (2 years ago). and then suddenly causing this issue?

It could be the root cause, but there could be other factors at play as well. The fact that the cluster is not correctly configured does make it difficult to identify any other potential issues though. This misconfiguration does not make clusters fail immediately and clusters can run fine for long periods of time without issues as it is only specific network/failure conditions might cause split brains. I would recommend reading the docs I linked to for more details.

This setting was often set incorrectly in earlier versions of Elasticsearch, leading to different kind of issues. This is why this area was reworked in Elasticsearch 7.0, where the setting has been removed as part of an effort to increase resilience.

great, thank you christian for your explanation. i would try to read the docs that you linked before.

Best Regards,
Bagus

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.