Hi, we are faced with a huge problem that is random nodes started disconnecting from the master and can't join to the cluster again util i restart it.
First, the messages from master node.The master detects that a node has disconnected, however, this situation is treated as an immediate failure.So the master removed the node from the cluster.We can see there is 1325 delayed shards.
[2020-11-29T10:19:15,836][INFO ][o.e.c.s.MasterService ] [hw-sh-t-opslog-02-masternode]node-left[{hw-sh-t-opslog-10-datanode_stale}{g9wVAUPLQvaFCszqo1paNw}{oloGCXQeRRibpUPN8vmUVQ}{10.221.42.246}{10.221.42.246:9301}{dil}{ml.machine_memory=67560857600, ml.max_open_jobs=20, xpack.installed=true, box_type=stale} reason: disconnected], term: 680, version: 1054740, delta: removed {{hw-sh-t-opslog-10-datanode_stale}{g9wVAUPLQvaFCszqo1paNw}{oloGCXQeRRibpUPN8vmUVQ}{10.221.42.246}{10.221.42.246:9301}{dil}{ml.machine_memory=67560857600, ml.max_open_jobs=20, xpack.installed=true, box_type=stale}}
[2020-11-29T10:19:20,623][INFO ][o.e.c.s.ClusterApplierService] [hw-sh-t-opslog-02-masternode]removed {{hw-sh-t-opslog-10-datanode_stale}{g9wVAUPLQvaFCszqo1paNw}{oloGCXQeRRibpUPN8vmUVQ}{10.221.42.246}{10.221.42.246:9301}{dil}{ml.machine_memory=67560857600, ml.max_open_jobs=20, xpack.installed=true, box_type=stale}}, term: 680, version: 1054740, reason: Publication{term=680, version=1054740}
[2020-11-29T10:19:20,668][INFO ][o.e.c.r.DelayedAllocationService] [hw-sh-t-opslog-02-masternode]scheduling reroute for delayed shards in [4.8m] (1325 delayed shards)
Then we can see some messages from the datanode.The data node started [3] consecutive leader check and got a failure because the master has removed the data node from the cluster state.
[2020-11-29T10:19:18,821][INFO ][o.e.c.c.Coordinator ] [hw-sh-t-opslog-10-datanode_stale]master node [{hw-sh-t-opslog-02-masternode}{EFhLqgmZS4-FmcXZtc-eSg}{M1_QGw4tTbSsEUKsVcV4sA}{10.221.46.66}{10.221.46.66:9310}{ilm}{ml.machine_memory=67560849408, ml.max_open_jobs=20, xpack.installed=true}] failed, restarting discovery
org.elasticsearch.ElasticsearchException: node [{hw-sh-t-opslog-02-masternode}{EFhLqgmZS4-FmcXZtc-eSg}{M1_QGw4tTbSsEUKsVcV4sA}{10.221.46.66}{10.221.46.66:9310}{ilm}{ml.machine_memory=67560849408, ml.max_open_jobs=20, xpack.installed=true}] failed [3] consecutive checks
Caused by: org.elasticsearch.transport.RemoteTransportException: [hw-sh-t-opslog-02-masternode][10.221.46.66:9310][internal:coordination/fault_detection/leader_check]
Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: rejecting leader check since [{hw-sh-t-opslog-10-datanode_stale}{g9wVAUPLQvaFCszqo1paNw}{oloGCXQeRRibpUPN8vmUVQ}{10.221.42.246}{10.221.42.246:9301}{dil}{ml.machine_memory=67560857600, ml.max_open_jobs=20, xpack.installed=true, box_type=stale}] has been removed from the cluster
After these happens, the datanode changed the cluster state.
We can see the data node chaged the master node from previous [{hw-sh-t-opslog-02-masternode} to a empty list . The infinite loop happened, data node is always say the master not discovered yet, and have discovered [...hw-sh-t-opslog-02-masternode...].Actually, hw-sh-t-opslog-02-masternode is the real master! I am confused that why data node considered it not the master node.The data node keep the cluster state version forever.Is there some conflict to master node's cluster state?
[2020-11-29T10:19:18,825][INFO ][o.e.c.s.ClusterApplierService] [hw-sh-t-opslog-10-datanode_stale]master node changed {previous [{hw-sh-t-opslog-02-masternode}{EFhLqgmZS4-FmcXZtc-eSg}{M1_QGw4tTbSsEUKsVcV4sA}{10.221.46.66}{10.221.46.66:9310}{ilm}{ml.machine_memory=67560849408, ml.max_open_jobs=20, xpack.installed=true}], current []}, term: 680, version: 1054738, reason: becoming candidate: onLeaderFailure
[2020-11-29T10:19:28,827][WARN ][o.e.c.c.ClusterFormationFailureHelper] [hw-sh-t-opslog-10-datanode_stale]master not discovered yet: have discovered [{hw-sh-t-opslog-10-datanode_stale}{g9wVAUPLQvaFCszqo1paNw}{oloGCXQeRRibpUPN8vmUVQ}{10.221.42.246}{10.221.42.246:9301}{dil}{ml.machine_memory=67560857600, xpack.installed=true, box_type=stale, ml.max_open_jobs=20}, {hw-sh-t-opslog-03-masternode}{V9-7ygtvTfSouxWOHsu3MQ}{fEpYCF3QQvacGL7nLPHXew}{10.221.40.80}{10.221.40.80:9310}{ilm}{ml.machine_memory=67560857600, ml.max_open_jobs=20, xpack.installed=true}, {hw-sh-t-opslog-02-masternode}{EFhLqgmZS4-FmcXZtc-eSg}{M1_QGw4tTbSsEUKsVcV4sA}{10.221.46.66}{10.221.46.66:9310}{ilm}{ml.machine_memory=67560849408, ml.max_open_jobs=20, xpack.installed=true}, {hw-sh-t-opslog-01-masternode}{PGELd7MkQgCSTh_aYS9CpA}{zj9ve9h0SSq5F9NMTii8tA}{10.221.39.248}{10.221.39.248:9310}{ilm}{ml.machine_memory=67560857600, ml.max_open_jobs=20, xpack.installed=true}]; discovery will continue using [10.221.39.248:9310, 10.221.46.66:9310, 10.221.40.80:9310] from hosts providers and [{hw-sh-t-opslog-03-masternode}{V9-7ygtvTfSouxWOHsu3MQ}{fEpYCF3QQvacGL7nLPHXew}{10.221.40.80}{10.221.40.80:9310}{ilm}{ml.machine_memory=67560857600, ml.max_open_jobs=20, xpack.installed=true}, {hw-sh-t-opslog-02-masternode}{EFhLqgmZS4-FmcXZtc-eSg}{M1_QGw4tTbSsEUKsVcV4sA}{10.221.46.66}{10.221.46.66:9310}{ilm}{ml.machine_memory=67560849408, ml.max_open_jobs=20, xpack.installed=true}, {hw-sh-t-opslog-01-masternode}{PGELd7MkQgCSTh_aYS9CpA}{zj9ve9h0SSq5F9NMTii8tA}{10.221.39.248}{10.221.39.248:9310}{ilm}{ml.machine_memory=67560857600, ml.max_open_jobs=20, xpack.installed=true}] from last-known cluster state; node term 680, last-accepted version 1054738 in term 680
[2020-11-29T10:19:34,202][DEBUG][o.e.a.s.m.TransportMasterNodeAction] [hw-sh-t-opslog-10-datanode_stale]no known master node, scheduling a retry
[2020-11-29T10:19:38,828][WARN ][o.e.c.c.ClusterFormationFailureHelper] [hw-sh-t-opslog-10-datanode_stale]master not discovered yet: have discovered [{hw-sh-t-opslog-10-datanode_stale}{g9wVAUPLQvaFCszqo1paNw}{oloGCXQeRRibpUPN8vmUVQ}{10.221.42.246}{10.221.42.246:9301}{dil}{ml.machine_memory=67560857600, xpack.installed=true, box_type=stale, ml.max_open_jobs=20}, {hw-sh-t-opslog-03-masternode}{V9-7ygtvTfSouxWOHsu3MQ}{fEpYCF3QQvacGL7nLPHXew}{10.221.40.80}{10.221.40.80:9310}{ilm}{ml.machine_memory=67560857600, ml.max_open_jobs=20, xpack.installed=true}, {hw-sh-t-opslog-02-masternode}{EFhLqgmZS4-FmcXZtc-eSg}{M1_QGw4tTbSsEUKsVcV4sA}{10.221.46.66}{10.221.46.66:9310}{ilm}{ml.machine_memory=67560849408, ml.max_open_jobs=20, xpack.installed=true}, {hw-sh-t-opslog-01-masternode}{PGELd7MkQgCSTh_aYS9CpA}{zj9ve9h0SSq5F9NMTii8tA}{10.221.39.248}{10.221.39.248:9310}{ilm}{ml.machine_memory=67560857600, ml.max_open_jobs=20, xpack.installed=true}]; discovery will continue using [10.221.39.248:9310, 10.221.46.66:9310, 10.221.40.80:9310] from hosts providers and [{hw-sh-t-opslog-03-masternode}{V9-7ygtvTfSouxWOHsu3MQ}{fEpYCF3QQvacGL7nLPHXew}{10.221.40.80}{10.221.40.80:9310}{ilm}{ml.machine_memory=67560857600, ml.max_open_jobs=20, xpack.installed=true}, {hw-sh-t-opslog-02-masternode}{EFhLqgmZS4-FmcXZtc-eSg}{M1_QGw4tTbSsEUKsVcV4sA}{10.221.46.66}{10.221.46.66:9310}{ilm}{ml.machine_memory=67560849408, ml.max_open_jobs=20, xpack.installed=true}, {hw-sh-t-opslog-01-masternode}{PGELd7MkQgCSTh_aYS9CpA}{zj9ve9h0SSq5F9NMTii8tA}{10.221.39.248}{10.221.39.248:9310}{ilm}{ml.machine_memory=67560857600, ml.max_open_jobs=20, xpack.installed=true}] from last-known cluster state; node term 680, last-accepted version 1054738 in term 680
For the same time, according to the master's message, we find that the data node join the cluster and soon left, again and again.Util i restarted the data node, the cluster is recovered.I guess restart can resetting the cluster state that saved by data node?
[2020-11-29T10:19:24,892][INFO ][o.e.c.s.MasterService ] [hw-sh-t-opslog-02-masternode]node-join[{hw-sh-t-opslog-10-datanode_stale}{g9wVAUPLQvaFCszqo1paNw}{oloGCXQeRRibpUPN8vmUVQ}{10.221.42.246}{10.221.42.246:9301}{dil}{ml.machine_memory=67560857600, ml.max_open_jobs=20, xpack.installed=true, box_type=stale} join existing leader], term: 680, version: 1054742, delta: added {{hw-sh-t-opslog-10-datanode_stale}{g9wVAUPLQvaFCszqo1paNw}{oloGCXQeRRibpUPN8vmUVQ}{10.221.42.246}{10.221.42.246:9301}{dil}{ml.machine_memory=67560857600, ml.max_open_jobs=20, xpack.installed=true, box_type=stale}}
[2020-11-29T10:19:27,712][INFO ][o.e.c.s.ClusterApplierService] [hw-sh-t-opslog-02-masternode]added {{hw-sh-t-opslog-10-datanode_stale}{g9wVAUPLQvaFCszqo1paNw}{oloGCXQeRRibpUPN8vmUVQ}{10.221.42.246}{10.221.42.246:9301}{dil}{ml.machine_memory=67560857600, ml.max_open_jobs=20, xpack.installed=true, box_type=stale}}, term: 680, version: 1054742, reason: Publication{term=680, version=1054742}
[2020-11-29T10:19:30,595][INFO ][o.e.c.s.MasterService ] [hw-sh-t-opslog-02-masternode]node-left[{hw-sh-t-opslog-10-datanode_stale}{g9wVAUPLQvaFCszqo1paNw}{oloGCXQeRRibpUPN8vmUVQ}{10.221.42.246}{10.221.42.246:9301}{dil}{ml.machine_memory=67560857600, ml.max_open_jobs=20, xpack.installed=true, box_type=stale} reason: disconnected], term: 680, version: 1054743, delta: removed {{hw-sh-t-opslog-10-datanode_stale}{g9wVAUPLQvaFCszqo1paNw}{oloGCXQeRRibpUPN8vmUVQ}{10.221.42.246}{10.221.42.246:9301}{dil}{ml.machine_memory=67560857600, ml.max_open_jobs=20, xpack.installed=true, box_type=stale}}
[2020-11-29T10:19:31,238][INFO ][o.e.c.s.ClusterApplierService] [hw-sh-t-opslog-02-masternode]removed {{hw-sh-t-opslog-10-datanode_stale}{g9wVAUPLQvaFCszqo1paNw}{oloGCXQeRRibpUPN8vmUVQ}{10.221.42.246}{10.221.42.246:9301}{dil}{ml.machine_memory=67560857600, ml.max_open_jobs=20, xpack.installed=true, box_type=stale}}, term: 680, version: 1054743, reason: Publication{term=680, version=1054743}
How can i avoid the problem or slove it???Thanks a lot!
My cluster's config:
ElasticSearch version:7.5.1
elasticsearch.yml(master node):
bootstrap.memory_lock: true
cluster.max_shards_per_node: 10000
cluster.name: billions-uat7.5.1
cluster.routing.allocation.allow_rebalance: always
cluster.routing.allocation.cluster_concurrent_rebalance: 6
cluster.routing.allocation.node_concurrent_recoveries: 10
cluster.routing.allocation.node_initial_primaries_recoveries: 20
cluster.routing.allocation.same_shard.host: true
discovery.seed_hosts:
- hw-sh-t-opslog-01:9310
- hw-sh-t-opslog-02:9310
- hw-sh-t-opslog-03:9310
cluster.fault_detection.follower_check.interval: 10s
cluster.fault_detection.follower_check.timeout: 60s
cluster.fault_detection.follower_check.retry_count: 3
cluster.fault_detection.leader_check.interval: 10s
cluster.fault_detection.leader_check.timeout: 60s
cluster.fault_detection.leader_check.retry_count: 3
indices.breaker.total.use_real_memory: false
http.port: 9210
indices.recovery.max_bytes_per_sec: 500mb
network.host: 0.0.0.0
node.data: false
node.master: true
transport.tcp.port: 9310
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.keystore.path: /etc/elasticsearch/masternode/billions-certificates.p12
xpack.security.transport.ssl.truststore.path: /etc/elasticsearch/masternode/billions-certificates.p12
xpack.security.transport.ssl.verification_mode: certificate
node.name: hw-sh-t-opslog-02-masternode
path.data: /mnt/storage01/elasticsearch/data/hw-sh-t-opslog-02-masternode
path.logs: /mnt/storage01/elasticsearch/log/hw-sh-t-opslog-02-masternode
action.auto_create_index: true
xpack.security.enabled: true
elasticsearch.yml(data node):
bootstrap.memory_lock: true
cluster.max_shards_per_node: 10000
cluster.name: billions-uat7.5.1
cluster.routing.allocation.allow_rebalance: always
cluster.routing.allocation.cluster_concurrent_rebalance: 5
cluster.routing.allocation.node_concurrent_recoveries: 10
cluster.routing.allocation.node_initial_primaries_recoveries: 20
cluster.routing.allocation.same_shard.host: true
discovery.seed_hosts:
- hw-sh-t-opslog-01:9310
- hw-sh-t-opslog-02:9310
- hw-sh-t-opslog-03:9310
discovery.zen.fd.ping_interval: 10s
discovery.zen.fd.ping_retries: 3
discovery.zen.fd.ping_timeout: 60s
discovery.zen.ping_timeout: 10s
http.port: 9201
indices.recovery.max_bytes_per_sec: 500mb
network.host: 0.0.0.0
node.attr.box_type: stale
node.data: true
node.master: false
transport.tcp.port: 9301
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.keystore.path: /etc/elasticsearch/datanode_stale/billions-certificates.p12
xpack.security.transport.ssl.truststore.path: /etc/elasticsearch/datanode_stale/billions-certificates.p12
xpack.security.transport.ssl.verification_mode: certificate
node.name: hw-sh-t-opslog-10-datanode_stale
path.data: /mnt/storage01/hw-sh-t-opslog-10-datanode_stale
path.logs: /mnt/storage01/elasticsearch/log/hw-sh-t-opslog-10-datanode_stale
action.auto_create_index: true
xpack.security.enabled: true