Master_not_discovered_exception

Hello Everyone;

I created elastic environment with 4 nodes. 2 nodes set master role other 2 role set data node. When all node was starting everything is okey. but when any master node down, my cluster not working. Other msater node not elected as master.

My cluster configuration

VPSYSMNGELK01(10.30.40.30)
path.data: /usr/share/Elasticsearch/data
node.roles: [ master]
network.host: 0.0.0.0
#transport.host: 0.0.0.0
bootstrap.memory_lock: true
cluster.name: MyCluster
node.name: "VPSYSMNGELK01.test.com"
discovery.seed_hosts: ["VPSYSMNGELK01.test.com", "VPSYSMNGELK02.test.com", "VPSYSMNGELK03.test.com", "VPSYSMNGELK04.test.com"]
cluster.initial_master_nodes: ["VPSYSMNGELK01.test.com", "VPSYSMNGELK02.test.com"]
xpack.security.enabled: false

VPSYSMNGELK02(10.30.40.31)
path.data: /usr/share/Elasticsearch/data
node.roles: [ master]
network.host: 0.0.0.0
#transport.host: 0.0.0.0
bootstrap.memory_lock: true
cluster.name: MyCluster
node.name: "VPSYSMNGELK02.test.com"
discovery.seed_hosts: ["VPSYSMNGELK01.test.com", "VPSYSMNGELK02.test.com", "VPSYSMNGELK03.test.com", "VPSYSMNGELK04.test.com"]
cluster.initial_master_nodes: ["VPSYSMNGELK01.test.com", "VPSYSMNGELK02.test.com"]
xpack.security.enabled: false

VPSYSMNGELK03(10.30.40.32)
path.data: /usr/share/Elasticsearch/data
node.roles: [ data ]
network.host: 0.0.0.0
#transport.host: 0.0.0.0
bootstrap.memory_lock: true
cluster.name: MyCluster
node.name: "VPSYSMNGELK03.test.com"
discovery.seed_hosts: ["VPSYSMNGELK01.test.com", "VPSYSMNGELK02.test.com", "VPSYSMNGELK03.test.com", "VPSYSMNGELK04.test.com"]
cluster.initial_master_nodes: ["VPSYSMNGELK01.test.com", "VPSYSMNGELK02.test.com"]
xpack.security.enabled: false

VPSYSMNGELK04(10.30.40.33)
path.data: /usr/share/Elasticsearch/data
node.roles: [ data ]
network.host: 0.0.0.0
#transport.host: 0.0.0.0
bootstrap.memory_lock: true
cluster.name: MyCluster
node.name: "VPSYSMNGELK04.test.com"
discovery.seed_hosts: ["VPSYSMNGELK01.test.com", "VPSYSMNGELK02.test.com", "VPSYSMNGELK03.test.com", "VPSYSMNGELK04.test.com"]
cluster.initial_master_nodes: ["VPSYSMNGELK01.test.com", "VPSYSMNGELK02.test.com"]
xpack.security.enabled: false

When every node start;

When stop VPSYSMNGELK01 master node cluster get faild;

[2022-03-20T15:03:07,093][INFO ][o.e.c.c.Coordinator      ] [VPSYSMNGELK02.test.com] master node [{VPSYSMNGELK01.test.com}{RA4OtAorQpiwcs5WCI6r1Q}{GI_aBd_YSi-Ere_kvLZ_yw}{10.30.40.30}{10.30.40.30:9300}{m}] disconnected, restarting discovery
[2022-03-20T15:03:07,097][INFO ][o.e.c.s.ClusterApplierService] [VPSYSMNGELK02.test.com] master node changed {previous [{VPSYSMNGELK01.test.com}{RA4OtAorQpiwcs5WCI6r1Q}{GI_aBd_YSi-Ere_kvLZ_yw}{10.30.40.30}{10.30.40.30:9300}{m}], current []}, term: 12, version: 343, reason: becoming candidate: onLeaderFailure
[2022-03-20T15:03:07,104][WARN ][o.e.c.NodeConnectionsService] [VPSYSMNGELK02.test.com] failed to connect to {VPSYSMNGELK01.test.com}{RA4OtAorQpiwcs5WCI6r1Q}{GI_aBd_YSi-Ere_kvLZ_yw}{10.30.40.30}{10.30.40.30:9300}{m}{xpack.installed=true} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [VPSYSMNGELK01.test.com][10.30.40.30:9300] connect_exception
        at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:1107) ~[elasticsearch-8.1.0.jar:8.1.0]
        at org.elasticsearch.action.ActionListener.lambda$toBiConsumer$0(ActionListener.java:279) ~[elasticsearch-8.1.0.jar:8.1.0]
        at org.elasticsearch.core.CompletableContext.lambda$addListener$0(CompletableContext.java:31) ~[elasticsearch-core-8.1.0.jar:8.1.0]
        at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863) ~[?:?]
        at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841) ~[?:?]
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?]
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2162) ~[?:?]
        at org.elasticsearch.core.CompletableContext.completeExceptionally(CompletableContext.java:46) ~[elasticsearch-core-8.1.0.jar:8.1.0]
        at org.elasticsearch.transport.netty4.Netty4TcpChannel.lambda$addListener$0(Netty4TcpChannel.java:63) ~[?:?]
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578) ~[?:?]
        at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571) ~[?:?]
        at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550) ~[?:?]
        at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491) ~[?:?]
        at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616) ~[?:?]
        at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609) ~[?:?]
        at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) ~[?:?]
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:321) ~[?:?]
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:337) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:710) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:623) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:586) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) ~[?:?]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) ~[?:?]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 10.30.40.30/10.30.40.30:9300
Caused by: java.net.ConnectException: Connection refused
        at sun.nio.ch.Net.pollConnect(Native Method) ~[?:?]
        at sun.nio.ch.Net.pollConnectNow(Net.java:672) ~[?:?]
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:946) ~[?:?]
        at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330) ~[?:?]
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334) ~[?:?]
        ... 7 more
[2022-03-20T15:03:17,103][WARN ][o.e.c.c.ClusterFormationFailureHelper] [VPSYSMNGELK02.test.com] master not discovered or elected yet, an election requires a node with id [RA4OtAorQpiwcs5WCI6r1Q], have only discovered non-quorum [{VPSYSMNGELK02.test.com}{DJf3OpY6RaepLMJHKpSwSA}{JEOLyTh5QNWaSmaV8Phn7w}{10.30.40.31}{10.30.40.31:9300}{m}]; discovery will continue using [10.30.40.30:9300, 10.30.40.32:9300, 10.30.40.33:9300] from hosts providers and [{VPSYSMNGELK01.test.com}{RA4OtAorQpiwcs5WCI6r1Q}{GI_aBd_YSi-Ere_kvLZ_yw}{10.30.40.30}{10.30.40.30:9300}{m}, {VPSYSMNGELK02.test.com}{DJf3OpY6RaepLMJHKpSwSA}{JEOLyTh5QNWaSmaV8Phn7w}{10.30.40.31}{10.30.40.31:9300}{m}] from last-known cluster state; node term 12, last-accepted version 343 in term 12
[2022-03-20T15:03:27,106][WARN ][o.e.c.c.ClusterFormationFailureHelper] [VPSYSMNGELK02.test.com] master not discovered or elected yet, an election requires a node with id [RA4OtAorQpiwcs5WCI6r1Q], have only discovered non-quorum [{VPSYSMNGELK02.test.com}{DJf3OpY6RaepLMJHKpSwSA}{JEOLyTh5QNWaSmaV8Phn7w}{10.30.40.31}{10.30.40.31:9300}{m}]; discovery will continue using [10.30.40.30:9300, 10.30.40.32:9300, 10.30.40.33:9300] from hosts providers and [{VPSYSMNGELK01.test.com}{RA4OtAorQpiwcs5WCI6r1Q}{GI_aBd_YSi-Ere_kvLZ_yw}{10.30.40.30}{10.30.40.30:9300}{m}, {VPSYSMNGELK02.test.com}{DJf3OpY6RaepLMJHKpSwSA}{JEOLyTh5QNWaSmaV8Phn7w}{10.30.40.31}{10.30.40.31:9300}{m}] from last-known cluster state; node term 12, last-accepted version 343 in term 12
[2022-03-20T15:03:37,113][WARN ][o.e.c.c.ClusterFormationFailureHelper] [VPSYSMNGELK02.test.com] master not discovered or elected yet, an election requires a node with id [RA4OtAorQpiwcs5WCI6r1Q], have only discovered non-quorum [{VPSYSMNGELK02.test.com}{DJf3OpY6RaepLMJHKpSwSA}{JEOLyTh5QNWaSmaV8Phn7w}{10.30.40.31}{10.30.40.31:9300}{m}]; discovery will continue using [10.30.40.30:9300, 10.30.40.32:9300, 10.30.40.33:9300] from hosts providers and [{VPSYSMNGELK01.test.com}{RA4OtAorQpiwcs5WCI6r1Q}{GI_aBd_YSi-Ere_kvLZ_yw}{10.30.40.30}{10.30.40.30:9300}{m}, {VPSYSMNGELK02.test.com}{DJf3OpY6RaepLMJHKpSwSA}{JEOLyTh5QNWaSmaV8Phn7w}{10.30.40.31}{10.30.40.31:9300}{m}] from last-known cluster state; node term 12, last-accepted version 343 in term 12
root@VPSYSMNGELK02:/home/arif#

Could you help me to solve the problem?

Elasticsearch always requires a strict majority of master eligible nodes to be available in order to elect a master, so with only 2 master eligible nodes both have to be avaiable (strict majority of 2 is 2) for the cluster to function properly. This is why 3 master eligible nodes is reconmmended, as the majority of 3 is 2 and one node can go down without affecting the cluster. Have a look at this section in the docs for further details. For small clusters it is common to have 3 nodes that hold data and all are master eligible.

You are superman. Thanks a lot your support. I added one more master. My cluster work.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.