Hi, we upgraded to Elasticseach 7.1.1 from 6.8 last week, the cluster has 3 master nodes and 60 data nodes. First few days everything worked as expected. On the third day random nodes started disconnecting and connecting after only few minutes. At first many nodes did this, most of the time one-by-one, sometimes multiple at once. Since last few days only few nodes are dropping out, others working fine.
Master nodes are called es-dbm-00x where x is 1,2,3
Data nodes are called es-dbs-0xx where x is 1 to 60
These are the settings for data nodes (extracted from es-dbs-013):
/usr/local/bin/docker-entrypoint.sh elasticsearch
-E bootstrap.memory_lock=true
-E cluster.name=es-research-cloud
-E cluster.routing.use_adaptive_replica_selection=true
-E discovery.seed_hosts=21.166.10.201:9300,21.166.10.202:9300,21.166.10.203:9300
-E http.compression=true
-E http.port=9200
-E http.host=35.194.221.253
-E indices.breaker.fielddata.limit=90%
-E indices.breaker.request.limit=90%
-E indices.breaker.total.limit=90%
-E logger.level=INFO
-E network.host=21.166.10.13,35.194.221.253
-E node.data=true
-E node.ingest=true
-E node.master=false
-E node.name=es-dbs-013
-E search.remote.connect=false
-E transport.host=21.166.10.13
-E transport.port=9300
These are the settings for master nodes (extracted from es-dbm-001):
/usr/local/bin/docker-entrypoint.sh elasticsearch
-E bootstrap.memory_lock=true
-E cluster.initial_master_nodes=21.166.10.201:9300,21.166.10.202:9300,21.166.10.203:9300
-E cluster.name=es-research-cloud
-E cluster.routing.use_adaptive_replica_selection=true
-E discovery.seed_hosts=21.166.10.201:9300,21.166.10.202:9300,21.166.10.203:9300
-E http.compression=true
-E http.port=9200
-E http.host=35.194.221.238
-E indices.breaker.fielddata.limit=90%
-E indices.breaker.request.limit=90%
-E indices.breaker.total.limit=90%
-E logger.level=INFO
-E network.host=21.166.10.201,35.194.221.238
-E node.data=false
-E node.ingest=false
-E node.master=true
-E node.name=es-dbm-001
-E search.remote.connect=false
-E transport.host=21.166.10.201
-E transport.port=9300
the error that happens on data node when it drops out of the cluster:
{"type": "server", "timestamp": "2019-06-26T13:14:38,307+0000", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "es-research-cloud", "node.name": "es-dbs-013", "cluster.uuid": "..., "node.id": "...",
"message": "master not discovered yet: have discovered [
{es-dbm-003}{21.166.10.203}{21.166.10.203:9300}{ml.machine_memory=135084789760, ml.max_open_jobs=20, xpack.installed=true},
{es-dbm-001}{21.166.10.201}{21.166.10.201:9300}{ml.machine_memory=135080017920, ml.max_open_jobs=20, xpack.installed=true},
{es-dbm-002}{21.166.10.202}{21.166.10.202:9300}{ml.machine_memory=135084916736, ml.max_open_jobs=20, xpack.installed=true}];
discovery will continue using [21.166.10.201:9300, 21.166.10.202:9300, 21.166.10.203:9300]
from hosts providers and [{es-dbs-045}{21.166.10.45}{21.166.10.45:9300}, {es-dbs-022}...
It says it didn't discover master, then says it discovered all masters that we have.
This error is sometimes followed by
"Caused by: java.lang.IllegalStateException: failure when sending a validation request to node",
"at org.elasticsearch.cluster.coordination.Coordinator$3.onFailure(Coordinator.java:500) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.cluster.coordination.JoinHelper$5.handleException(JoinHelper.java:359) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1124) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.transport.TransportService$8.run(TransportService.java:966) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]",
"at java.lang.Thread.run(Thread.java:835) [?:?]",
"Caused by: org.elasticsearch.transport.NodeDisconnectedException: [es-dbs-013][21.166.10.13:9300][internal:cluster/coordination/join/validate] disconnected"] }
{"type": "server", "timestamp": "2019-06-26T13:14:55,985+0000", "level": "INFO", "component": "o.e.c.c.JoinHelper", "cluster.name": "es-research-cloud", "node.name": "es-dbs-013", "cluster.uuid": "Tnbn6gyVRUWU4p-m--4gIA", "node.id": "lttfGsoiTI6tgR8TU6iSSA",
"message": "failed to join {es-dbm-002}{zGdGM6McSXWTnOo8_R96rQ}{qJwzpTX6TIa69euyarijQw}{21.166.10.202}{21.166.10.202:9300}{ml.machine_memory=135084916736, ml.max_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={es-dbs-013}{lttfGsoiTI6tgR8TU6iSSA}{4QZioY9rReS9foJojxPqHA}{21.166.10.13}{21.166.10.13:9300}{ml.machine_memory=135084597248, xpack.installed=true, ml.max_open_jobs=20}, optionalJoin=Optional[Join{term=13, lastAcceptedTerm=11, lastAcceptedVersion=142862, sourceNode={es-dbs-013}{lttfGsoiTI6tgR8TU6iSSA}{4QZioY9rReS9foJojxPqHA}{21.166.10.13}{21.166.10.13:9300}{ml.machine_memory=135084597248, xpack.installed=true, ml.max_open_jobs=20}, targetNode={es-dbm-002}{zGdGM6McSXWTnOo8_R96rQ}{qJwzpTX6TIa69euyarijQw}{21.166.10.202}{21.166.10.202:9300}{ml.machine_memory=135084916736, ml.max_open_jobs=20, xpack.installed=true}}]}"
Posting the rest in comment below because the limit is 7000 characters: