Upgrade from 6.8 to 7.1 - master not discovered yet: have discovered [...lists all master nodes...]

Hi, we upgraded to Elasticseach 7.1.1 from 6.8 last week, the cluster has 3 master nodes and 60 data nodes. First few days everything worked as expected. On the third day random nodes started disconnecting and connecting after only few minutes. At first many nodes did this, most of the time one-by-one, sometimes multiple at once. Since last few days only few nodes are dropping out, others working fine.

Master nodes are called es-dbm-00x where x is 1,2,3
Data nodes are called es-dbs-0xx where x is 1 to 60

These are the settings for data nodes (extracted from es-dbs-013):

/usr/local/bin/docker-entrypoint.sh elasticsearch
-E bootstrap.memory_lock=true
-E cluster.name=es-research-cloud
-E cluster.routing.use_adaptive_replica_selection=true
-E discovery.seed_hosts=21.166.10.201:9300,21.166.10.202:9300,21.166.10.203:9300
-E http.compression=true
-E http.port=9200
-E http.host=35.194.221.253
-E indices.breaker.fielddata.limit=90%
-E indices.breaker.request.limit=90%
-E indices.breaker.total.limit=90%
-E logger.level=INFO
-E network.host=21.166.10.13,35.194.221.253
-E node.data=true
-E node.ingest=true
-E node.master=false
-E node.name=es-dbs-013
-E search.remote.connect=false
-E transport.host=21.166.10.13
-E transport.port=9300

These are the settings for master nodes (extracted from es-dbm-001):

/usr/local/bin/docker-entrypoint.sh elasticsearch
-E bootstrap.memory_lock=true
-E cluster.initial_master_nodes=21.166.10.201:9300,21.166.10.202:9300,21.166.10.203:9300
-E cluster.name=es-research-cloud
-E cluster.routing.use_adaptive_replica_selection=true
-E discovery.seed_hosts=21.166.10.201:9300,21.166.10.202:9300,21.166.10.203:9300
-E http.compression=true
-E http.port=9200
-E http.host=35.194.221.238
-E indices.breaker.fielddata.limit=90%
-E indices.breaker.request.limit=90%
-E indices.breaker.total.limit=90%
-E logger.level=INFO
-E network.host=21.166.10.201,35.194.221.238
-E node.data=false
-E node.ingest=false
-E node.master=true
-E node.name=es-dbm-001
-E search.remote.connect=false
-E transport.host=21.166.10.201
-E transport.port=9300

the error that happens on data node when it drops out of the cluster:

{"type": "server", "timestamp": "2019-06-26T13:14:38,307+0000", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "es-research-cloud", "node.name": "es-dbs-013", "cluster.uuid": "..., "node.id": "...",  
"message": "master not discovered yet: have discovered [
     {es-dbm-003}{21.166.10.203}{21.166.10.203:9300}{ml.machine_memory=135084789760, ml.max_open_jobs=20, xpack.installed=true}, 
     {es-dbm-001}{21.166.10.201}{21.166.10.201:9300}{ml.machine_memory=135080017920, ml.max_open_jobs=20, xpack.installed=true}, 
     {es-dbm-002}{21.166.10.202}{21.166.10.202:9300}{ml.machine_memory=135084916736, ml.max_open_jobs=20, xpack.installed=true}]; 
discovery will continue using [21.166.10.201:9300, 21.166.10.202:9300, 21.166.10.203:9300] 
from hosts providers and [{es-dbs-045}{21.166.10.45}{21.166.10.45:9300}, {es-dbs-022}...

It says it didn't discover master, then says it discovered all masters that we have.
This error is sometimes followed by

"Caused by: java.lang.IllegalStateException: failure when sending a validation request to node",
"at org.elasticsearch.cluster.coordination.Coordinator$3.onFailure(Coordinator.java:500) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.cluster.coordination.JoinHelper$5.handleException(JoinHelper.java:359) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1124) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.transport.TransportService$8.run(TransportService.java:966) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]",
"at java.lang.Thread.run(Thread.java:835) [?:?]",
"Caused by: org.elasticsearch.transport.NodeDisconnectedException: [es-dbs-013][21.166.10.13:9300][internal:cluster/coordination/join/validate] disconnected"] }
{"type": "server", "timestamp": "2019-06-26T13:14:55,985+0000", "level": "INFO", "component": "o.e.c.c.JoinHelper", "cluster.name": "es-research-cloud", "node.name": "es-dbs-013", "cluster.uuid": "Tnbn6gyVRUWU4p-m--4gIA", "node.id": "lttfGsoiTI6tgR8TU6iSSA", 
  "message": "failed to join {es-dbm-002}{zGdGM6McSXWTnOo8_R96rQ}{qJwzpTX6TIa69euyarijQw}{21.166.10.202}{21.166.10.202:9300}{ml.machine_memory=135084916736, ml.max_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={es-dbs-013}{lttfGsoiTI6tgR8TU6iSSA}{4QZioY9rReS9foJojxPqHA}{21.166.10.13}{21.166.10.13:9300}{ml.machine_memory=135084597248, xpack.installed=true, ml.max_open_jobs=20}, optionalJoin=Optional[Join{term=13, lastAcceptedTerm=11, lastAcceptedVersion=142862, sourceNode={es-dbs-013}{lttfGsoiTI6tgR8TU6iSSA}{4QZioY9rReS9foJojxPqHA}{21.166.10.13}{21.166.10.13:9300}{ml.machine_memory=135084597248, xpack.installed=true, ml.max_open_jobs=20}, targetNode={es-dbm-002}{zGdGM6McSXWTnOo8_R96rQ}{qJwzpTX6TIa69euyarijQw}{21.166.10.202}{21.166.10.202:9300}{ml.machine_memory=135084916736, ml.max_open_jobs=20, xpack.installed=true}}]}" 

Posting the rest in comment below because the limit is 7000 characters:

On master node these are the messages after node goes and comes back:

{"type": "server", "timestamp": "2019-06-26T13:43:33,460+0000", "level": "INFO", "component": "o.e.c.s.ClusterApplierService", "cluster.name": "es-research-cloud", "node.name": "es-dbm-002", "cluster.uuid": "Tnbn6gyVRUWU4p-m--4gIA", "node.id": "ATaVaYN6QZePAa5s5IMhsQ",  
"message": "removed {{es-dbs-013}{lttfGsoiTI6tgR8TU6iSSA}{4QZioY9rReS9foJojxPqHA}{21.166.10.13}{21.166.10.13:9300}{ml.machine_memory=135084597248, ml.max_open_jobs=20, xpack.installed=true},{es-dbs-028}{DEtJJF32SwqAXfPc_ATcaw}{XW45RIjBTBeXodTpqyGsZg}{21.166.10.28}{21.166.10.28:9300}{ml.machine_memory=135084613632, ml.max_open_jobs=20, xpack.installed=true},}, term: 13, version: 143206, reason: ApplyCommitRequest{term=13, version=143206, 
sourceNode={es-dbm-002}{zGdGM6McSXWTnOo8_R96rQ}{qJwzpTX6TIa69euyarijQw}{21.166.10.202}{21.166.10.202:9300}{ml.machine_memory=135084916736, ml.max_open_jobs=20, xpack.installed=true}}"  }

{"type": "server", "timestamp": "2019-06-26T13:43:36,381+0000", "level": "INFO", "component": "o.e.c.s.ClusterApplierService", "cluster.name": "es-research-cloud", "node.name": "es-dbm-002", "cluster.uuid": "Tnbn6gyVRUWU4p-m--4gIA", "node.id": "ATaVaYN6QZePAa5s5IMhsQ",  
"message": "added {{es-dbs-013}{lttfGsoiTI6tgR8TU6iSSA}{4QZioY9rReS9foJojxPqHA}{21.166.10.13}{21.166.10.13:9300}{ml.machine_memory=135084597248, ml.max_open_jobs=20, xpack.installed=true},}, term: 13, version: 143208, reason: ApplyCommitRequest{term=13, version=143208, 
sourceNode={es-dbm-002}{zGdGM6McSXWTnOo8_R96rQ}{qJwzpTX6TIa69euyarijQw}{21.166.10.202}{21.166.10.202:9300}{ml.machine_memory=135084916736, ml.max_open_jobs=20, xpack.installed=true}}"  }

So far we tried:

  • changing transport.tcp.port to transport.port
  • adding cluster.initial_master_nodes to master settings
  • adding discovery.seed_hosts to master and data nodes settings
  • restarting cluster one by one node at a time
  • turning off all master nodes at the same time, then turning them on
  • shutting down whole cluster, then starting it again

After all these steps, it appears that only two nodes now have this issue, es-dbs-013 and es-dbs-028.

One important note - the problems appeared only few days after the update.
After these two commands were executed on that day, the problems started immediatelly:

PUT _cluster/settings
{"transient":{"cluster.routing.allocation.exclude._name":"es-dbs-059"}}

and

PUT _cluster/settings
{"transient":{"cluster.routing.allocation.cluster_concurrent_rebalance":25}}

The es-dbs-059 node was temporarily excluded because of some issues with hardware, and the cluster_concurrent_rebalance was set to 25 because it was the same value before we upgraded the ES version. I don't know how is this related to the issue, but it does seem like it triggered something.

Thank you.

The issue was caused by loading of global ordinals on fairly large shards (~50GB). During this time the nodes were not able to boot properly and it resulted in nodes removal from cluster. This is most likely why the cluster was operating without problems until the es-dbs-059 node was excluded and shards from that node went to other nodes in cluster causing them to fail.

Setting this option to false in indices mappings fixed the issue, the cluster is green again after almost a week.

Thanks to ElasticSearch team for great collaboration, you can monitor status of the issue here.

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.