Could not get cluster status after master node goes down

Vahid · May 18, 2018, 9:55am

Hi,

recently we have upgraded ES to 5.6.7

I have a two nodes cluster. After stopping master node, the command "curl -X GET localhost:9211/_cluster/health doesn't respond.

I've run jstack and seen that many threads are blocked. For example:

Thread 22447: (state = BLOCKED)

sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=175 (Compiled frame)
java.util.concurrent.LinkedTransferQueue.awaitMatch(java.util.concurrent.LinkedTransferQueue$Node, java.util.concurrent.LinkedTransferQueue$Node, java.lang.Object, boolean, long) @bci=184, line=737 (Compiled frame)
java.util.concurrent.LinkedTransferQueue.xfer(java.lang.Object, boolean, int, long) @bci=286, line=647 (Compiled frame)
java.util.concurrent.LinkedTransferQueue.take() @bci=5, line=1269 (Compiled frame)
org.elasticsearch.common.util.concurrent.SizeBlockingQueue.take() @bci=4, line=161 (Compiled frame)
java.util.concurrent.ThreadPoolExecutor.getTask() @bci=149, line=1067 (Compiled frame)
java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=26, line=1127 (Interpreted frame)
java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=617 (Interpreted frame)
java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)

or

Thread 21494: (state = BLOCKED)

org.apache.logging.log4j.core.layout.TextEncoderHelper.writeChunkedEncodedText(java.nio.charset.CharsetEncoder, java.nio.CharBuffer, org.apache.logging.log4j.core.layout.ByteBufferDestination, java.nio.ByteB
uffer, java.nio.charset.CoderResult) @bci=5, line=112 (Interpreted frame)
org.apache.logging.log4j.core.layout.TextEncoderHelper.writeEncodedText(java.nio.charset.CharsetEncoder, java.nio.CharBuffer, java.nio.ByteBuffer, org.apache.logging.log4j.core.layout.ByteBufferDestination,
java.nio.charset.CoderResult) @bci=14, line=79 (Interpreted frame)
org.apache.logging.log4j.core.layout.TextEncoderHelper.encodeChunkedText(java.nio.charset.CharsetEncoder, java.nio.CharBuffer, java.nio.ByteBuffer, java.lang.StringBuilder, org.apache.logging.log4j.core.layo
ut.ByteBufferDestination) @bci=91, line=143 (Interpreted frame)
org.apache.logging.log4j.core.layout.TextEncoderHelper.encodeText(java.nio.charset.CharsetEncoder, java.nio.CharBuffer, java.nio.ByteBuffer, java.lang.StringBuilder, org.apache.logging.log4j.core.layout.Byte
BufferDestination) @bci=22, line=58 (Interpreted frame)
org.apache.logging.log4j.core.layout.StringBuilderEncoder.encode(java.lang.StringBuilder, org.apache.logging.log4j.core.layout.ByteBufferDestination) @bci=37, line=68 (Interpreted frame)
org.apache.logging.log4j.core.layout.StringBuilderEncoder.encode(java.lang.Object, org.apache.logging.log4j.core.layout.ByteBufferDestination) @bci=6, line=32 (Interpreted frame)

After starting the master node again I see the following error messages in the logs:

Cheers,
Vahid

Abhilash_Bolla · May 18, 2018, 11:10am

Looks like a split brain situation to me. You may have a look at this

Vahid · May 18, 2018, 12:09pm

it was working fine with two nodes with version 1.7 . There are one replica shard for indices. So if one nodes goes off, cluster should work properly. And if the off node come up it should join as well....

JKhondhu · May 18, 2018, 12:11pm

so you are just testing if one node goes down that you can still query the cluster.

can you show us your elasticsearch.yml

Vahid · May 18, 2018, 12:15pm

discovery.zen.ping.unicast.hosts: ["ip-first-node:9311","ip-second-node:9311"]
cluster.name: cluster-name
indices.ttl.interval: 86400s
http.port: 9211
reindex.remote.whitelist: localhost:*
transport.tcp.compress: true
transport.tcp.port: 9311
path.repo: backup-folder
discovery.zen.minimum_master_nodes: 1
bootstrap.system_call_filter: false
path.data: data-folder
network.host: 0.0.0.0
node.name: ip-first-node:10110
action.auto_create_index: false

JKhondhu · May 18, 2018, 12:17pm

this should be 2.
The quorum is as per here: Important Elasticsearch configuration | Elasticsearch Reference [5.6] | Elastic

Vahid · May 18, 2018, 12:18pm

So if I have only 2 nodes, and one crash, with "discovery.zen.minimum_master_nodes: 2" still works?

Actually this problem reported by our customers and they have 3 nodes, with the minimum_master_nodes set on 2. So I don't think this is the root cause.

JKhondhu · May 18, 2018, 12:23pm

Then your repro on a two node cluster with quorum set to 1 is wrong. please revise inline with what your customer sees and report back.

Vahid · May 18, 2018, 12:40pm

So you mean that we could not have a cluster with two nodes, if one crash the another node can not service anymore?

These are additional configurations which are applied on a three nodes cluster:

cluster.routing.allocation.awareness.attributes: sitename
node.master: true
node.data: true
discovery.zen.fd.ping_timeout: 30s
discovery.zen.fd.ping_retries: 3
indices.memory.index_buffer_size: 20%
indices.fielddata.cache.size: 10%
gateway.expected_nodes: 1
cluster.routing.allocation.allow_rebalance: indices_primaries_active
node.attr.sitename: xxx
discovery.zen.ping_timeout: 10s
gateway.recover_after_nodes: 1
gateway.recover_after_time: 5m
discovery.zen.minimum_master_nodes: 2

Vahid · May 18, 2018, 1:15pm

I've also set the minimum _master node to 2 and restarted one node. They can not find each other... It works only if I restart both nodes...

JKhondhu · May 18, 2018, 2:23pm

after a yaml config change (on all nodes) a restart must be actioned, on all nodes.

Vahid · May 18, 2018, 2:52pm

Thank you for your feedback. I'm curious to know if I can have a two nodes cluster and which can work even one node crash.

Christian_Dahlqvist · May 18, 2018, 3:40pm

If you are looking for high availability you need a minimum of 3 master-eligible nodes so the remaining two nodes can form a majority and elect a new master.

Abhilash_Bolla · May 21, 2018, 11:35am

It'll work only if the slave node is down. AFAIK, Re-election will take place only if both the nodes are master-eligible.

Vahid · May 28, 2018, 9:21am

These are the configuration of three nodes:
node1:

node2:

node3:

After restarting the master node (node 1), the two rmaining nodes doesn't respond to any request.
In the logs of Node3 there are logs which saying node3 still trying to connect to gone master (node 1), instead of forming a new cluster with node 2, and the same for node 3.

Thank you for your feedbacks.
Vahid

Vahid · May 28, 2018, 12:59pm

Both are master-eligible. Anyway it seems that high availibility with two nodes with the versions higher that 1.x is not supported any more and at least three nodes must be in the cluster.

Christian_Dahlqvist · May 28, 2018, 3:11pm

It has never been possible to have a fully highly available cluster with only two nodes.

Vahid · May 29, 2018, 6:29am

It was possible, we were using it for our test environments and after crashing one node, still cluster was serving properly with only one node. However with suffering from split-brain sometimes, even with three nodes...

Christian_Dahlqvist · May 29, 2018, 6:56am

If you were able to serve writes when one node was down in a two-node cluster, it means that you have not set minimum_master_nodes correctly. This can lead to split-brain scenarios and data loss. You should always set minimum_master_nodes according to these guidelines.

Vahid · May 29, 2018, 7:10am

As I wrote in the initial comment, we were using ES 1.7 and for this version always there was a possibility of a split brain, even with configuring the minimum_mast_nodes to a high value (nodes/2 +1) and more than two nodes. Have a look to the old version (https://www.elastic.co/guide/en/elasticsearch/reference/1.7/modules-node.html#split-brain).

Howver now we have a bigger problem, with three nodes and mentioned configuration we have no high availibility!

Topic		Replies	Views
ES java api: how to handle connectivity problems? Elasticsearch	11	1526	July 6, 2017
Cluster is down and master nodes are not coming up Elasticsearch	17	2279	June 26, 2019
App hangs (with es blocking requests) Elasticsearch	5	1025	July 6, 2017
MasterNotDiscoveredException Elasticsearch	1	291	July 6, 2017
Master node hangs when multiple data nodes are shutdown at the same time Elasticsearch	6	954	July 6, 2017

Could not get cluster status after master node goes down

Related topics