Cluster stops workning when shutting down master marked with a star? [7.11.2]

Hi!

I have a three node cluster where all nodes are maste nodes and I'm using Kibana for checking status and running queries.
Running the command:

GET _cat/nodes

Gives me:

ip          heap.percent ram.percent cpu load_1m load_5m load_15m node.role  master name
10.14.86.46            9          44   3                          cdhilmrstw *      STHLM-KLARA-05
10.14.86.45           20          44  20                          cdhilmrstw -      STHLM-KLARA-04
10.14.86.47           20          39   1                          cdhilmrstw -      STHLM-KLARA-06

Shutting down 04 or 06 removes them from the list above but shutting down 05 I can no longer query my cluster using Kibana, I get a time out.

The following is found in the log of 04:

[2021-09-07T16:13:22,374][TRACE][o.e.d.PeerFinder         ] [STHLM-KLARA-04] startProbe(10.14.86.45:9300) not probing local node
[2021-09-07T16:13:22,375][TRACE][o.e.d.SeedHostsResolver  ] [STHLM-KLARA-04] resolved host [10.14.86.45] to [10.14.86.45:9300]
[2021-09-07T16:13:22,375][TRACE][o.e.d.SeedHostsResolver  ] [STHLM-KLARA-04] resolved host [10.14.86.46] to [10.14.86.46:9300]
[2021-09-07T16:13:22,375][TRACE][o.e.d.SeedHostsResolver  ] [STHLM-KLARA-04] resolved host [10.14.86.47] to [10.14.86.47:9300]
[2021-09-07T16:13:22,375][TRACE][o.e.d.PeerFinder         ] [STHLM-KLARA-04] probing resolved transport addresses [10.14.86.46:9300, 10.14.86.47:9300]
[2021-09-07T16:13:22,375][TRACE][o.e.d.PeerFinder         ] [STHLM-KLARA-04] Peer{transportAddress=10.14.86.47:9300, discoveryNode={STHLM-KLARA-06}{EpEj69OASPeVQ3TdiZ5qEA}{Fcd1rsIgTY6-phJyChUK5g}{10.14.86.47}{10.14.86.47:9300}{cdhilmrstw}{ml.machine_memory=17178800128, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=2147483648, transform.node=true}, peersRequestInFlight=true} received PeersResponse{masterNode=Optional.empty, knownPeers=[{STHLM-KLARA-04}{sxRVZzEgRCCcRyGK5sULrQ}{UAdd75CvRpqospEfkFOcKw}{10.14.86.45}{10.14.86.45:9300}{cdhilmrstw}{ml.machine_memory=17178800128, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=2147483648, transform.node=true}], term=49}
[2021-09-07T16:13:22,375][TRACE][o.e.d.PeerFinder         ] [STHLM-KLARA-04] startProbe(10.14.86.45:9300) not probing local node
[2021-09-07T16:13:23,229][TRACE][o.e.d.PeerFinder         ] [STHLM-KLARA-04] startProbe(10.14.86.45:9300) not probing local node
[2021-09-07T16:13:23,370][DEBUG][o.e.d.PeerFinder         ] [STHLM-KLARA-04] Peer{transportAddress=10.14.86.46:9300, discoveryNode=null, peersRequestInFlight=false} connection failed
org.elasticsearch.transport.ConnectTransportException: [][10.14.86.46:9300] connect_timeout[3s]
	at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:973) ~[elasticsearch-7.11.2.jar:7.11.2]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:673) ~[elasticsearch-7.11.2.jar:7.11.2]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
	at java.lang.Thread.run(Thread.java:832) [?:?]
[2021-09-07T16:13:23,386][TRACE][o.e.d.PeerFinder         ] [STHLM-KLARA-04] Peer{transportAddress=10.14.86.47:9300, discoveryNode={STHLM-KLARA-06}{EpEj69OASPeVQ3TdiZ5qEA}{Fcd1rsIgTY6-phJyChUK5g}{10.14.86.47}{10.14.86.47:9300}{cdhilmrstw}{ml.machine_memory=17178800128, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=2147483648, transform.node=true}, peersRequestInFlight=false} requesting peers
[2021-09-07T16:13:23,386][TRACE][o.e.d.PeerFinder         ] [STHLM-KLARA-04] probing master nodes from cluster state: nodes: 
   {STHLM-KLARA-04}{sxRVZzEgRCCcRyGK5sULrQ}{UAdd75CvRpqospEfkFOcKw}{10.14.86.45}{10.14.86.45:9300}{cdhilmrstw}{ml.machine_memory=17178800128, xpack.installed=true, transform.node=true, ml.max_open_jobs=20, ml.max_jvm_size=2147483648}, local
   {STHLM-KLARA-05}{2GJuobw8RAGQE5t3J79f5Q}{N3Y6fybpRRa7B7HIqkmX4w}{10.14.86.46}{10.14.86.46:9300}{cdhilmrstw}{ml.machine_memory=17178800128, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=2147483648, transform.node=true}, master
   {STHLM-KLARA-06}{EpEj69OASPeVQ3TdiZ5qEA}{Fcd1rsIgTY6-phJyChUK5g}{10.14.86.47}{10.14.86.47:9300}{cdhilmrstw}{ml.machine_memory=17178800128, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=2147483648, transform.node=true}

They config files looks like this:

bootstrap.memory_lock: false
cluster.name: KLARANATET-ELASTIC-TEST
discovery.seed_hosts:
  - 10.14.86.45
  - 10.14.86.46
  - 10.14.86.47
http.port: 9200
network.host: 10.14.86.45
node.data: true
node.ingest: true
node.master: true
node.max_local_storage_nodes: 1
node.name: STHLM-KLARA-04
path.data: D:\Elastic\ElasticSearch\Data
path.logs: D:\Elastic\ElasticSearch\Logs
transport.tcp.port: 9300
xpack.license.self_generated.type: basic
xpack.security.enabled: false
logger.org.elasticsearch.discovery: TRACE

Why is the master not changed when the 05 node is shutdown? The whole idea with a cluster solution is that one node can go down and it will still work.

What am I missing?

Thanks!

/Kristoffer

The TRACE logs you shared look fine to me, but then they only last for a couple of seconds so they don't really tell us much. Best switch them off for now, and then share logs for the whole outage.

Is Kibana configured to be able to connect to all nodes in the cluster? Is it able to connect to all nodes?

Ok, I will swich off and share the logs

No, I actually have Kibana setup on two servers and each server is setup to just use the local instance. Should Kibana always have all nodes in the config?

Here is the log on 04 when 05 is shut down:

[2021-09-07T17:05:48,081][INFO ][o.e.c.c.Coordinator      ] [STHLM-KLARA-04] master node [{STHLM-KLARA-05}{2GJuobw8RAGQE5t3J79f5Q}{Yq_jQYqNT4OT4J_rEoCviQ}{10.14.86.46}{10.14.86.46:9300}{cdhilmrstw}{ml.machine_memory=17178800128, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=2147483648, transform.node=true}] failed, restarting discovery
org.elasticsearch.transport.NodeDisconnectedException: [STHLM-KLARA-05][10.14.86.46:9300][disconnected] disconnected
[2021-09-07T17:05:48,096][INFO ][o.e.c.s.ClusterApplierService] [STHLM-KLARA-04] master node changed {previous [{STHLM-KLARA-05}{2GJuobw8RAGQE5t3J79f5Q}{Yq_jQYqNT4OT4J_rEoCviQ}{10.14.86.46}{10.14.86.46:9300}{cdhilmrstw}{ml.machine_memory=17178800128, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=2147483648, transform.node=true}], current []}, term: 50, version: 2272, reason: becoming candidate: onLeaderFailure
[2021-09-07T17:05:57,543][WARN ][r.suppressed             ] [STHLM-KLARA-04] path: /_monitoring/bulk, params: {system_id=kibana, system_api_version=7, interval=10000ms}
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];
	at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:179) ~[elasticsearch-7.11.2.jar:7.11.2]
	at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(ClusterBlocks.java:165) ~[elasticsearch-7.11.2.jar:7.11.2]
	at org.elasticsearch.xpack.monitoring.action.TransportMonitoringBulkAction.doExecute(TransportMonitoringBulkAction.java:56) ~[?:?]
	at org.elasticsearch.xpack.monitoring.action.TransportMonitoringBulkAction.doExecute(TransportMonitoringBulkAction.java:36) ~[?:?]
	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:173) [elasticsearch-7.11.2.jar:7.11.2]
	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:149) [elasticsearch-7.11.2.jar:7.11.2]
	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:77) [elasticsearch-7.11.2.jar:7.11.2]
	at org.elasticsearch.client.node.NodeClient.executeLocally(NodeClient.java:86) [elasticsearch-7.11.2.jar:7.11.2]
	at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:66) [elasticsearch-7.11.2.jar:7.11.2]
	at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:402) [elasticsearch-7.11.2.jar:7.11.2]
	at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:59) [elasticsearch-7.11.2.jar:7.11.2]
	at org.elasticsearch.xpack.monitoring.rest.action.RestMonitoringBulkAction.lambda$doPrepareRequest$0(RestMonitoringBulkAction.java:108) [x-pack-monitoring-7.11.2.jar:7.11.2]
	at org.elasticsearch.rest.BaseRestHandler.handleRequest(BaseRestHandler.java:104) [elasticsearch-7.11.2.jar:7.11.2]
	at org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:247) [elasticsearch-7.11.2.jar:7.11.2]
	at org.elasticsearch.rest.RestController.tryAllHandlers(RestController.java:329) [elasticsearch-7.11.2.jar:7.11.2]
	at org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:180) [elasticsearch-7.11.2.jar:7.11.2]
	at org.elasticsearch.http.AbstractHttpServerTransport.dispatchRequest(AbstractHttpServerTransport.java:325) [elasticsearch-7.11.2.jar:7.11.2]
	at org.elasticsearch.http.AbstractHttpServerTransport.handleIncomingRequest(AbstractHttpServerTransport.java:390) [elasticsearch-7.11.2.jar:7.11.2]
	at org.elasticsearch.http.AbstractHttpServerTransport.incomingRequest(AbstractHttpServerTransport.java:307) [elasticsearch-7.11.2.jar:7.11.2]
	at org.elasticsearch.http.netty4.Netty4HttpRequestHandler.channelRead0(Netty4HttpRequestHandler.java:31) [transport-netty4-client-7.11.2.jar:7.11.2]
	at org.elasticsearch.http.netty4.Netty4HttpRequestHandler.channelRead0(Netty4HttpRequestHandler.java:17) [transport-netty4-client-7.11.2.jar:7.11.2]
	at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at org.elasticsearch.http.netty4.Netty4HttpPipeliningHandler.channelRead(Netty4HttpPipeliningHandler.java:47) [transport-netty4-client-7.11.2.jar:7.11.2]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) [netty-codec-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) [netty-codec-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.handler.codec.MessageToMessageCodec.channelRead(MessageToMessageCodec.java:111) [netty-codec-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) [netty-codec-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) [netty-codec-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:324) [netty-codec-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:296) [netty-codec-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286) [netty-handler-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) [netty-codec-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:615) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:578) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) [netty-transport-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) [netty-common-4.1.49.Final.jar:4.1.49.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.49.Final.jar:4.1.49.Final]
	at java.lang.Thread.run(Thread.java:832) [?:?]
[2021-09-07T17:05:58,106][WARN ][o.e.c.c.ClusterFormationFailureHelper] [STHLM-KLARA-04] master not discovered or elected yet, an election requires a node with id [2GJuobw8RAGQE5t3J79f5Q], have discovered [{STHLM-KLARA-04}{sxRVZzEgRCCcRyGK5sULrQ}{VeNU9zB0RoyN5PIyvaHAvw}{10.14.86.45}{10.14.86.45:9300}{cdhilmrstw}{ml.machine_memory=17178800128, xpack.installed=true, transform.node=true, ml.max_open_jobs=20, ml.max_jvm_size=2147483648}, {STHLM-KLARA-06}{EpEj69OASPeVQ3TdiZ5qEA}{Fcd1rsIgTY6-phJyChUK5g}{10.14.86.47}{10.14.86.47:9300}{cdhilmrstw}{ml.machine_memory=17178800128, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=2147483648, transform.node=true}] which is not a quorum; discovery will continue using [10.14.86.46:9300, 10.14.86.47:9300] from hosts providers and [{STHLM-KLARA-05}{2GJuobw8RAGQE5t3J79f5Q}{Yq_jQYqNT4OT4J_rEoCviQ}{10.14.86.46}{10.14.86.46:9300}{cdhilmrstw}{ml.machine_memory=17178800128, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=2147483648, transform.node=true}, {STHLM-KLARA-06}{EpEj69OASPeVQ3TdiZ5qEA}{Fcd1rsIgTY6-phJyChUK5g}{10.14.86.47}{10.14.86.47:9300}{cdhilmrstw}{ml.machine_memory=17178800128, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=2147483648, transform.node=true}, {STHLM-KLARA-04}{sxRVZzEgRCCcRyGK5sULrQ}{VeNU9zB0RoyN5PIyvaHAvw}{10.14.86.45}{10.14.86.45:9300}{cdhilmrstw}{ml.machine_memory=17178800128, xpack.installed=true, transform.node=true, ml.max_open_jobs=20, ml.max_jvm_size=2147483648}] from last-known cluster state; node term 50, last-accepted version 2272 in term 50

Thank you so much!

/Kristoffer

Could you bring the missing node back up and run DELETE /_cluster/voting_config_exclusions, then try again?

Sorry, correction, DELETE /_cluster/voting_config_exclusions won't work, you need to run DELETE /_cluster/voting_config_exclusions?wait_for_removal=false instead.

Yes! What just happened here and how did I end up like this in the first place?
When I was playing around with this I tried this:

POST /_cluster/voting_config_exclusions?node_names=STHLM-KLARA-XX

Could this have affected anything?

Thank you so much David!

/Kristoffer

Yes definitely. See Add and remove nodes in your cluster | Elasticsearch Guide [7.11] | Elastic and particularly this sentence:

Clusters should have no voting configuration exclusions in normal operation.

You should always delete the voting config exclusions when you're finished with them.

2 Likes