Master-Eligible node cluster resiliency for ElasticSearch 7.x

Hello,

I am using Elasticsearch v7.17.0. Here is my configuration that I am testing the new version with,

I have dedicated master node (3) and data node (1). This sets up my cluster and am in business. However, when I scale down the master node (and in turn terminate the active-master node), I lose the cluster state information.

Error:

curl http://escluster.abc.internal:9200/_cat/nodes?v
{"error":{"root_cause":[{"type":"master_not_discovered_exception","reason":null}],"type":"master_not_discovered_exception","reason":null},"status":503}

When I setup initial cluster I only had one-master and one-date and used this property to setup the initial master configuration and then later on added two master nodes which in turn updated the property cluster.initial_master_nodes to add two more IPs.

/etc/elasticsearch/elasticsearch.yml:

cluster.name: escluster
network.host: _local:ipv4_, _site_
path.data: /data/lib
path.logs: /data/log
cluster.initial_master_nodes: [ip-10-192-106-161,ip-10-192-105-166,ip-10-192-108-236]
discovery.seed_providers: ec2
discovery.ec2.groups: sg-xxxxx
discovery.ec2.endpoint: ec2.us-east-1.amazonaws.com
node.roles: master

If I do things more gracefully, that is remove the active node from the voting configuration and then terminate the cluster maintains status. However, in real world scenario systems can come and go so want to understand how can I achieve resiliency between my cluster of master nodes.

Also note that I thought cluster.initial_master_nodes is only needed during intial bootstrap and then no longer used so I am confused why this test is not resilient.

Adding some log entries as well, it seems once the active master node is lost this the error message from other master node.

[2022-02-14T10:34:39,474][WARN ][o.e.c.c.ClusterFormationFailureHelper] [ip-10-192-105-166] master not discovered or elected yet, an election requires a node with id [n8lziZmbTgqKYEvMlc5krg], have only discovered non-quorum [{ip-10-192-105-166}{FIhJ-P81SvaXTFkjq-PRbg}{1tVS0hmtQc6-phTARZjtFg}{10.192.105.166}{10.192.105.166:9300}{m}]; discovery will continue using [127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, [::1]:9300, [::1]:9301, [::1]:9302, [::1]:9303, [::1]:9304, [::1]:9305, 10.192.105.38:9300, 10.192.105.166:9300] from hosts providers and [{ip-10-192-106-161}{n8lziZmbTgqKYEvMlc5krg}{L0GCQ91fTxGASbIASxTenQ}{10.192.106.161}{10.192.106.161:9300}{m}, {ip-10-192-105-166}{FIhJ-P81SvaXTFkjq-PRbg}{1tVS0hmtQc6-phTARZjtFg}{10.192.105.166}{10.192.105.166:9300}{m}] from last-known cluster state; node term 4, last-accepted version 86 in term 4
[2022-02-14T10:34:49,475][WARN ][o.e.c.c.ClusterFormationFailureHelper] [ip-10-192-105-166] master not discovered or elected yet, an election requires a node with id [n8lziZmbTgqKYEvMlc5krg], have only discovered non-quorum [{ip-10-192-105-166}{FIhJ-P81SvaXTFkjq-PRbg}{1tVS0hmtQc6-phTARZjtFg}{10.192.105.166}{10.192.105.166:9300}{m}]; discovery will continue using [127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, [::1]:9300, [::1]:9301, [::1]:9302, [::1]:9303, [::1]:9304, [::1]:9305, 10.192.105.38:9300, 10.192.105.166:9300] from hosts providers and [{ip-10-192-106-161}{n8lziZmbTgqKYEvMlc5krg}{L0GCQ91fTxGASbIASxTenQ}{10.192.106.161}{10.192.106.161:9300}{m}, {ip-10-192-105-166}{FIhJ-P81SvaXTFkjq-PRbg}{1tVS0hmtQc6-phTARZjtFg}{10.192.105.166}{10.192.105.166:9300}{m}] from last-known cluster state; node term 4, last-accepted version 86 in term 4
[2022-02-14T10:34:59,477][WARN ][o.e.c.c.ClusterFormationFailureHelper] [ip-10-192-105-166] master not discovered or elected yet, an election requires a node with id [n8lziZmbTgqKYEvMlc5krg], have only discovered non-quorum [{ip-10-192-105-166}{FIhJ-P81SvaXTFkjq-PRbg}{1tVS0hmtQc6-phTARZjtFg}{10.192.105.166}{10.192.105.166:9300}{m}]; discovery will continue using [127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, [::1]:9300, [::1]:9301, [::1]:9302, [::1]:9303, [::1]:9304, [::1]:9305, 10.192.105.38:9300, 10.192.105.166:9300] from hosts providers and [{ip-10-192-106-161}{n8lziZmbTgqKYEvMlc5krg}{L0GCQ91fTxGASbIASxTenQ}{10.192.106.161}{10.192.106.161:9300}{m}, {ip-10-192-105-166}{FIhJ-P81SvaXTFkjq-PRbg}{1tVS0hmtQc6-phTARZjtFg}{10.192.105.166}{10.192.105.166:9300}{m}] from last-known cluster state; node term 4, last-accepted version 86 in term 4
[2022-02-14T10:35:09,479][WARN ][o.e.c.c.ClusterFormationFailureHelper] [ip-10-192-105-166] master not discovered or elected yet, an election requires a node with id [n8lziZmbTgqKYEvMlc5krg], have only discovered non-quorum [{ip-10-192-105-166}{FIhJ-P81SvaXTFkjq-PRbg}{1tVS0hmtQc6-phTARZjtFg}{10.192.105.166}{10.192.105.166:9300}{m}]; discovery will continue using [127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, [::1]:9300, [::1]:9301, [::1]:9302, [::1]:9303, [::1]:9304, [::1]:9305, 10.192.105.38:9300, 10.192.105.166:9300] from hosts providers and [{ip-10-192-106-161}{n8lziZmbTgqKYEvMlc5krg}{L0GCQ91fTxGASbIASxTenQ}{10.192.106.161}{10.192.106.161:9300}{m}, {ip-10-192-105-166}{FIhJ-P81SvaXTFkjq-PRbg}{1tVS0hmtQc6-phTARZjtFg}{10.192.105.166}{10.192.105.166:9300}{m}] from last-known cluster state; node term 4, last-accepted version 86 in term 4

Any help is appreciated.

-Cross

It looks like you only have two master-eligible nodes, but you need three for resilience. See these docs for more information.

This is correct, you should remove this setting once the cluster has formed. See these docs for more information.

Thanks @DavidTurner - this seems to be an important point, that I need to remove this setting once the cluster is formed. Can you please point me to how to remove this setting? using an api call on every host?

No, just delete the line from the elasticsearch.yml config file. Elasticsearch does not have write access to this file so there's no API to do this.

Thanks for your prompt reply, will test and report back my findings.

Unfortunately, the testing failed - let me walk through the process.

1 - initially I setup one dedicated master and one dedicate data node using node.roles={master|data} property and also having the cluster.initial_master_nodes set for the one master node.

2 - after step-1, I have a successful cluster formed.

curl http://escluster.local.internal:9200/_cat/nodes?v
ip             heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.192.106.236           29          32   0    0.04    0.03     0.01 m         *      ip-10-192-106-236
10.192.104.252           50          33   0    0.01    0.07     0.05 dilrt     -      ip-10-192-104-252

3 - then I removed this cluster.initial_master_nodes property from elasticearch.yml file of existing nodes and also ensured that new launched systems do not have this property configured.

4 - added one more master node in the pool and cluster was still healthy with one more master in the mix.

curl http://escluster.local.internal:9200/_cat/nodes?v
ip             heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.192.106.236           29          32   0    0.04    0.03     0.01 m         *      ip-10-192-106-236
10.192.105.242           19          27   9    0.90    0.60     0.25 m         -      ip-10-192-105-242
10.192.104.252           50          33   0    0.01    0.07     0.05 dilrt     -      ip-10-192-104-252

5 - at this stage, ip-10-192-106-236 is still the active master node.

6 - I terminate that node ip-10-192-106-236 assuming that ip-10-192-105-242 will become newly active master node. However, the cluster state went bad again - with following error,

curl http://escluster.local.internal:9200/_cat/nodes?v
{"error":{"root_cause":[{"type":"master_not_discovered_exception","reason":null}],"type":"master_not_discovered_exception","reason":null},"status":503} 

Any thoughts on this?

As I said in my first comment, you only have two master-eligible nodes. You need three for resilience.

thank you - the last part that I missed was master-cluster requires a quorum so cannot go from 2 to 1; but need atleast 3-master nodes to start with.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.