Upgrade from 6.8 to 7.1 - master not discovered yet: have discovered [...lists all master nodes...]

vangogh94 · June 26, 2019, 2:59pm

Hi, we upgraded to Elasticseach 7.1.1 from 6.8 last week, the cluster has 3 master nodes and 60 data nodes. First few days everything worked as expected. On the third day random nodes started disconnecting and connecting after only few minutes. At first many nodes did this, most of the time one-by-one, sometimes multiple at once. Since last few days only few nodes are dropping out, others working fine.

Master nodes are called es-dbm-00x where x is 1,2,3
Data nodes are called es-dbs-0xx where x is 1 to 60

These are the settings for data nodes (extracted from es-dbs-013):

/usr/local/bin/docker-entrypoint.sh elasticsearch
-E bootstrap.memory_lock=true
-E cluster.name=es-research-cloud
-E cluster.routing.use_adaptive_replica_selection=true
-E discovery.seed_hosts=21.166.10.201:9300,21.166.10.202:9300,21.166.10.203:9300
-E http.compression=true
-E http.port=9200
-E http.host=35.194.221.253
-E indices.breaker.fielddata.limit=90%
-E indices.breaker.request.limit=90%
-E indices.breaker.total.limit=90%
-E logger.level=INFO
-E network.host=21.166.10.13,35.194.221.253
-E node.data=true
-E node.ingest=true
-E node.master=false
-E node.name=es-dbs-013
-E search.remote.connect=false
-E transport.host=21.166.10.13
-E transport.port=9300

These are the settings for master nodes (extracted from es-dbm-001):

/usr/local/bin/docker-entrypoint.sh elasticsearch
-E bootstrap.memory_lock=true
-E cluster.initial_master_nodes=21.166.10.201:9300,21.166.10.202:9300,21.166.10.203:9300
-E cluster.name=es-research-cloud
-E cluster.routing.use_adaptive_replica_selection=true
-E discovery.seed_hosts=21.166.10.201:9300,21.166.10.202:9300,21.166.10.203:9300
-E http.compression=true
-E http.port=9200
-E http.host=35.194.221.238
-E indices.breaker.fielddata.limit=90%
-E indices.breaker.request.limit=90%
-E indices.breaker.total.limit=90%
-E logger.level=INFO
-E network.host=21.166.10.201,35.194.221.238
-E node.data=false
-E node.ingest=false
-E node.master=true
-E node.name=es-dbm-001
-E search.remote.connect=false
-E transport.host=21.166.10.201
-E transport.port=9300

the error that happens on data node when it drops out of the cluster:

{"type": "server", "timestamp": "2019-06-26T13:14:38,307+0000", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "es-research-cloud", "node.name": "es-dbs-013", "cluster.uuid": "..., "node.id": "...",  
"message": "master not discovered yet: have discovered [
     {es-dbm-003}{21.166.10.203}{21.166.10.203:9300}{ml.machine_memory=135084789760, ml.max_open_jobs=20, xpack.installed=true}, 
     {es-dbm-001}{21.166.10.201}{21.166.10.201:9300}{ml.machine_memory=135080017920, ml.max_open_jobs=20, xpack.installed=true}, 
     {es-dbm-002}{21.166.10.202}{21.166.10.202:9300}{ml.machine_memory=135084916736, ml.max_open_jobs=20, xpack.installed=true}]; 
discovery will continue using [21.166.10.201:9300, 21.166.10.202:9300, 21.166.10.203:9300] 
from hosts providers and [{es-dbs-045}{21.166.10.45}{21.166.10.45:9300}, {es-dbs-022}...

It says it didn't discover master, then says it discovered all masters that we have.
This error is sometimes followed by

"Caused by: java.lang.IllegalStateException: failure when sending a validation request to node",
"at org.elasticsearch.cluster.coordination.Coordinator$3.onFailure(Coordinator.java:500) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.cluster.coordination.JoinHelper$5.handleException(JoinHelper.java:359) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1124) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.transport.TransportService$8.run(TransportService.java:966) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]",
"at java.lang.Thread.run(Thread.java:835) [?:?]",
"Caused by: org.elasticsearch.transport.NodeDisconnectedException: [es-dbs-013][21.166.10.13:9300][internal:cluster/coordination/join/validate] disconnected"] }
{"type": "server", "timestamp": "2019-06-26T13:14:55,985+0000", "level": "INFO", "component": "o.e.c.c.JoinHelper", "cluster.name": "es-research-cloud", "node.name": "es-dbs-013", "cluster.uuid": "Tnbn6gyVRUWU4p-m--4gIA", "node.id": "lttfGsoiTI6tgR8TU6iSSA", 
  "message": "failed to join {es-dbm-002}{zGdGM6McSXWTnOo8_R96rQ}{qJwzpTX6TIa69euyarijQw}{21.166.10.202}{21.166.10.202:9300}{ml.machine_memory=135084916736, ml.max_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={es-dbs-013}{lttfGsoiTI6tgR8TU6iSSA}{4QZioY9rReS9foJojxPqHA}{21.166.10.13}{21.166.10.13:9300}{ml.machine_memory=135084597248, xpack.installed=true, ml.max_open_jobs=20}, optionalJoin=Optional[Join{term=13, lastAcceptedTerm=11, lastAcceptedVersion=142862, sourceNode={es-dbs-013}{lttfGsoiTI6tgR8TU6iSSA}{4QZioY9rReS9foJojxPqHA}{21.166.10.13}{21.166.10.13:9300}{ml.machine_memory=135084597248, xpack.installed=true, ml.max_open_jobs=20}, targetNode={es-dbm-002}{zGdGM6McSXWTnOo8_R96rQ}{qJwzpTX6TIa69euyarijQw}{21.166.10.202}{21.166.10.202:9300}{ml.machine_memory=135084916736, ml.max_open_jobs=20, xpack.installed=true}}]}"

Posting the rest in comment below because the limit is 7000 characters:

vangogh94 · June 26, 2019, 3:01pm

On master node these are the messages after node goes and comes back:

{"type": "server", "timestamp": "2019-06-26T13:43:33,460+0000", "level": "INFO", "component": "o.e.c.s.ClusterApplierService", "cluster.name": "es-research-cloud", "node.name": "es-dbm-002", "cluster.uuid": "Tnbn6gyVRUWU4p-m--4gIA", "node.id": "ATaVaYN6QZePAa5s5IMhsQ",  
"message": "removed {{es-dbs-013}{lttfGsoiTI6tgR8TU6iSSA}{4QZioY9rReS9foJojxPqHA}{21.166.10.13}{21.166.10.13:9300}{ml.machine_memory=135084597248, ml.max_open_jobs=20, xpack.installed=true},{es-dbs-028}{DEtJJF32SwqAXfPc_ATcaw}{XW45RIjBTBeXodTpqyGsZg}{21.166.10.28}{21.166.10.28:9300}{ml.machine_memory=135084613632, ml.max_open_jobs=20, xpack.installed=true},}, term: 13, version: 143206, reason: ApplyCommitRequest{term=13, version=143206, 
sourceNode={es-dbm-002}{zGdGM6McSXWTnOo8_R96rQ}{qJwzpTX6TIa69euyarijQw}{21.166.10.202}{21.166.10.202:9300}{ml.machine_memory=135084916736, ml.max_open_jobs=20, xpack.installed=true}}"  }

{"type": "server", "timestamp": "2019-06-26T13:43:36,381+0000", "level": "INFO", "component": "o.e.c.s.ClusterApplierService", "cluster.name": "es-research-cloud", "node.name": "es-dbm-002", "cluster.uuid": "Tnbn6gyVRUWU4p-m--4gIA", "node.id": "ATaVaYN6QZePAa5s5IMhsQ",  
"message": "added {{es-dbs-013}{lttfGsoiTI6tgR8TU6iSSA}{4QZioY9rReS9foJojxPqHA}{21.166.10.13}{21.166.10.13:9300}{ml.machine_memory=135084597248, ml.max_open_jobs=20, xpack.installed=true},}, term: 13, version: 143208, reason: ApplyCommitRequest{term=13, version=143208, 
sourceNode={es-dbm-002}{zGdGM6McSXWTnOo8_R96rQ}{qJwzpTX6TIa69euyarijQw}{21.166.10.202}{21.166.10.202:9300}{ml.machine_memory=135084916736, ml.max_open_jobs=20, xpack.installed=true}}"  }

So far we tried:

changing transport.tcp.port to transport.port
adding cluster.initial_master_nodes to master settings
adding discovery.seed_hosts to master and data nodes settings
restarting cluster one by one node at a time
turning off all master nodes at the same time, then turning them on
shutting down whole cluster, then starting it again

After all these steps, it appears that only two nodes now have this issue, es-dbs-013 and es-dbs-028.

One important note - the problems appeared only few days after the update.
After these two commands were executed on that day, the problems started immediatelly:

PUT _cluster/settings
{"transient":{"cluster.routing.allocation.exclude._name":"es-dbs-059"}}

and

PUT _cluster/settings
{"transient":{"cluster.routing.allocation.cluster_concurrent_rebalance":25}}

The es-dbs-059 node was temporarily excluded because of some issues with hardware, and the cluster_concurrent_rebalance was set to 25 because it was the same value before we upgraded the ES version. I don't know how is this related to the issue, but it does seem like it triggered something.

Thank you.

vangogh94 · June 27, 2019, 6:27pm

The issue was caused by loading of global ordinals on fairly large shards (~50GB). During this time the nodes were not able to boot properly and it resulted in nodes removal from cluster. This is most likely why the cluster was operating without problems until the es-dbs-059 node was excluded and shards from that node went to other nodes in cluster causing them to fail.

Setting this option to false in indices mappings fixed the issue, the cluster is green again after almost a week.

Thanks to ElasticSearch team for great collaboration, you can monitor status of the issue here.

system · July 25, 2019, 6:27pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elastic master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and this node must discover master-eligible nodes Elasticsearch	11	12192	February 5, 2020
Strange issue after upgrading from 1.1.0 to 1.4.1 ES Version Elasticsearch	8	388	July 6, 2017
Getting "master not discovered or elected yet" causing cluster not up in version 7.1.0 Elasticsearch	28	15260	July 1, 2019
Rolling upgrade problem from 6.8 to 7.1.1 Elasticsearch	18	2950	July 18, 2019
Master_left and no other node elected to become master Elasticsearch	5	1100	July 6, 2017

Upgrade from 6.8 to 7.1 - master not discovered yet: have discovered [...lists all master nodes...]

Related topics