ES datanode can't join the cluster since disconnected from master

​ Hi, we are faced with a huge problem that is random nodes started disconnecting from the master and can't join to the cluster again util i restart it.

First, the messages from master node.The master detects that a node has disconnected, however, this situation is treated as an immediate failure.So the master removed the node from the cluster.We can see there is 1325 delayed shards.

[2020-11-29T10:19:15,836][INFO ][o.e.c.s.MasterService    ] [hw-sh-t-opslog-02-masternode]node-left[{hw-sh-t-opslog-10-datanode_stale}{g9wVAUPLQvaFCszqo1paNw}{oloGCXQeRRibpUPN8vmUVQ}{10.221.42.246}{10.221.42.246:9301}{dil}{ml.machine_memory=67560857600, ml.max_open_jobs=20, xpack.installed=true, box_type=stale} reason: disconnected], term: 680, version: 1054740, delta: removed {{hw-sh-t-opslog-10-datanode_stale}{g9wVAUPLQvaFCszqo1paNw}{oloGCXQeRRibpUPN8vmUVQ}{10.221.42.246}{10.221.42.246:9301}{dil}{ml.machine_memory=67560857600, ml.max_open_jobs=20, xpack.installed=true, box_type=stale}}
[2020-11-29T10:19:20,623][INFO ][o.e.c.s.ClusterApplierService] [hw-sh-t-opslog-02-masternode]removed {{hw-sh-t-opslog-10-datanode_stale}{g9wVAUPLQvaFCszqo1paNw}{oloGCXQeRRibpUPN8vmUVQ}{10.221.42.246}{10.221.42.246:9301}{dil}{ml.machine_memory=67560857600, ml.max_open_jobs=20, xpack.installed=true, box_type=stale}}, term: 680, version: 1054740, reason: Publication{term=680, version=1054740}
[2020-11-29T10:19:20,668][INFO ][o.e.c.r.DelayedAllocationService] [hw-sh-t-opslog-02-masternode]scheduling reroute for delayed shards in [4.8m] (1325 delayed shards)

​ Then we can see some messages from the datanode.The data node started [3] consecutive leader check and got a failure because the master has removed the data node from the cluster state.

[2020-11-29T10:19:18,821][INFO ][o.e.c.c.Coordinator      ] [hw-sh-t-opslog-10-datanode_stale]master node [{hw-sh-t-opslog-02-masternode}{EFhLqgmZS4-FmcXZtc-eSg}{M1_QGw4tTbSsEUKsVcV4sA}{10.221.46.66}{10.221.46.66:9310}{ilm}{ml.machine_memory=67560849408, ml.max_open_jobs=20, xpack.installed=true}] failed, restarting discovery
org.elasticsearch.ElasticsearchException: node [{hw-sh-t-opslog-02-masternode}{EFhLqgmZS4-FmcXZtc-eSg}{M1_QGw4tTbSsEUKsVcV4sA}{10.221.46.66}{10.221.46.66:9310}{ilm}{ml.machine_memory=67560849408, ml.max_open_jobs=20, xpack.installed=true}] failed [3] consecutive checks
Caused by: org.elasticsearch.transport.RemoteTransportException: [hw-sh-t-opslog-02-masternode][10.221.46.66:9310][internal:coordination/fault_detection/leader_check]
Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: rejecting leader check since [{hw-sh-t-opslog-10-datanode_stale}{g9wVAUPLQvaFCszqo1paNw}{oloGCXQeRRibpUPN8vmUVQ}{10.221.42.246}{10.221.42.246:9301}{dil}{ml.machine_memory=67560857600, ml.max_open_jobs=20, xpack.installed=true, box_type=stale}] has been removed from the cluster

​ After these happens, the datanode changed the cluster state.

​ We can see the data node chaged the master node from previous [{hw-sh-t-opslog-02-masternode} to a empty list . The infinite loop happened, data node is always say the master not discovered yet, and have discovered [...hw-sh-t-opslog-02-masternode...].Actually, hw-sh-t-opslog-02-masternode is the real master! I am confused that why data node considered it not the master node.The data node keep the cluster state version forever.Is there some conflict to master node's cluster state?

[2020-11-29T10:19:18,825][INFO ][o.e.c.s.ClusterApplierService] [hw-sh-t-opslog-10-datanode_stale]master node changed {previous [{hw-sh-t-opslog-02-masternode}{EFhLqgmZS4-FmcXZtc-eSg}{M1_QGw4tTbSsEUKsVcV4sA}{10.221.46.66}{10.221.46.66:9310}{ilm}{ml.machine_memory=67560849408, ml.max_open_jobs=20, xpack.installed=true}], current []}, term: 680, version: 1054738, reason: becoming candidate: onLeaderFailure
[2020-11-29T10:19:28,827][WARN ][o.e.c.c.ClusterFormationFailureHelper] [hw-sh-t-opslog-10-datanode_stale]master not discovered yet: have discovered [{hw-sh-t-opslog-10-datanode_stale}{g9wVAUPLQvaFCszqo1paNw}{oloGCXQeRRibpUPN8vmUVQ}{10.221.42.246}{10.221.42.246:9301}{dil}{ml.machine_memory=67560857600, xpack.installed=true, box_type=stale, ml.max_open_jobs=20}, {hw-sh-t-opslog-03-masternode}{V9-7ygtvTfSouxWOHsu3MQ}{fEpYCF3QQvacGL7nLPHXew}{10.221.40.80}{10.221.40.80:9310}{ilm}{ml.machine_memory=67560857600, ml.max_open_jobs=20, xpack.installed=true}, {hw-sh-t-opslog-02-masternode}{EFhLqgmZS4-FmcXZtc-eSg}{M1_QGw4tTbSsEUKsVcV4sA}{10.221.46.66}{10.221.46.66:9310}{ilm}{ml.machine_memory=67560849408, ml.max_open_jobs=20, xpack.installed=true}, {hw-sh-t-opslog-01-masternode}{PGELd7MkQgCSTh_aYS9CpA}{zj9ve9h0SSq5F9NMTii8tA}{10.221.39.248}{10.221.39.248:9310}{ilm}{ml.machine_memory=67560857600, ml.max_open_jobs=20, xpack.installed=true}]; discovery will continue using [10.221.39.248:9310, 10.221.46.66:9310, 10.221.40.80:9310] from hosts providers and [{hw-sh-t-opslog-03-masternode}{V9-7ygtvTfSouxWOHsu3MQ}{fEpYCF3QQvacGL7nLPHXew}{10.221.40.80}{10.221.40.80:9310}{ilm}{ml.machine_memory=67560857600, ml.max_open_jobs=20, xpack.installed=true}, {hw-sh-t-opslog-02-masternode}{EFhLqgmZS4-FmcXZtc-eSg}{M1_QGw4tTbSsEUKsVcV4sA}{10.221.46.66}{10.221.46.66:9310}{ilm}{ml.machine_memory=67560849408, ml.max_open_jobs=20, xpack.installed=true}, {hw-sh-t-opslog-01-masternode}{PGELd7MkQgCSTh_aYS9CpA}{zj9ve9h0SSq5F9NMTii8tA}{10.221.39.248}{10.221.39.248:9310}{ilm}{ml.machine_memory=67560857600, ml.max_open_jobs=20, xpack.installed=true}] from last-known cluster state; node term 680, last-accepted version 1054738 in term 680
[2020-11-29T10:19:34,202][DEBUG][o.e.a.s.m.TransportMasterNodeAction] [hw-sh-t-opslog-10-datanode_stale]no known master node, scheduling a retry
[2020-11-29T10:19:38,828][WARN ][o.e.c.c.ClusterFormationFailureHelper] [hw-sh-t-opslog-10-datanode_stale]master not discovered yet: have discovered [{hw-sh-t-opslog-10-datanode_stale}{g9wVAUPLQvaFCszqo1paNw}{oloGCXQeRRibpUPN8vmUVQ}{10.221.42.246}{10.221.42.246:9301}{dil}{ml.machine_memory=67560857600, xpack.installed=true, box_type=stale, ml.max_open_jobs=20}, {hw-sh-t-opslog-03-masternode}{V9-7ygtvTfSouxWOHsu3MQ}{fEpYCF3QQvacGL7nLPHXew}{10.221.40.80}{10.221.40.80:9310}{ilm}{ml.machine_memory=67560857600, ml.max_open_jobs=20, xpack.installed=true}, {hw-sh-t-opslog-02-masternode}{EFhLqgmZS4-FmcXZtc-eSg}{M1_QGw4tTbSsEUKsVcV4sA}{10.221.46.66}{10.221.46.66:9310}{ilm}{ml.machine_memory=67560849408, ml.max_open_jobs=20, xpack.installed=true}, {hw-sh-t-opslog-01-masternode}{PGELd7MkQgCSTh_aYS9CpA}{zj9ve9h0SSq5F9NMTii8tA}{10.221.39.248}{10.221.39.248:9310}{ilm}{ml.machine_memory=67560857600, ml.max_open_jobs=20, xpack.installed=true}]; discovery will continue using [10.221.39.248:9310, 10.221.46.66:9310, 10.221.40.80:9310] from hosts providers and [{hw-sh-t-opslog-03-masternode}{V9-7ygtvTfSouxWOHsu3MQ}{fEpYCF3QQvacGL7nLPHXew}{10.221.40.80}{10.221.40.80:9310}{ilm}{ml.machine_memory=67560857600, ml.max_open_jobs=20, xpack.installed=true}, {hw-sh-t-opslog-02-masternode}{EFhLqgmZS4-FmcXZtc-eSg}{M1_QGw4tTbSsEUKsVcV4sA}{10.221.46.66}{10.221.46.66:9310}{ilm}{ml.machine_memory=67560849408, ml.max_open_jobs=20, xpack.installed=true}, {hw-sh-t-opslog-01-masternode}{PGELd7MkQgCSTh_aYS9CpA}{zj9ve9h0SSq5F9NMTii8tA}{10.221.39.248}{10.221.39.248:9310}{ilm}{ml.machine_memory=67560857600, ml.max_open_jobs=20, xpack.installed=true}] from last-known cluster state; node term 680, last-accepted version 1054738 in term 680

​ For the same time, according to the master's message, we find that the data node join the cluster and soon left, again and again.Util i restarted the data node, the cluster is recovered.I guess restart can resetting the cluster state that saved by data node?

[2020-11-29T10:19:24,892][INFO ][o.e.c.s.MasterService    ] [hw-sh-t-opslog-02-masternode]node-join[{hw-sh-t-opslog-10-datanode_stale}{g9wVAUPLQvaFCszqo1paNw}{oloGCXQeRRibpUPN8vmUVQ}{10.221.42.246}{10.221.42.246:9301}{dil}{ml.machine_memory=67560857600, ml.max_open_jobs=20, xpack.installed=true, box_type=stale} join existing leader], term: 680, version: 1054742, delta: added {{hw-sh-t-opslog-10-datanode_stale}{g9wVAUPLQvaFCszqo1paNw}{oloGCXQeRRibpUPN8vmUVQ}{10.221.42.246}{10.221.42.246:9301}{dil}{ml.machine_memory=67560857600, ml.max_open_jobs=20, xpack.installed=true, box_type=stale}}
[2020-11-29T10:19:27,712][INFO ][o.e.c.s.ClusterApplierService] [hw-sh-t-opslog-02-masternode]added {{hw-sh-t-opslog-10-datanode_stale}{g9wVAUPLQvaFCszqo1paNw}{oloGCXQeRRibpUPN8vmUVQ}{10.221.42.246}{10.221.42.246:9301}{dil}{ml.machine_memory=67560857600, ml.max_open_jobs=20, xpack.installed=true, box_type=stale}}, term: 680, version: 1054742, reason: Publication{term=680, version=1054742}
[2020-11-29T10:19:30,595][INFO ][o.e.c.s.MasterService    ] [hw-sh-t-opslog-02-masternode]node-left[{hw-sh-t-opslog-10-datanode_stale}{g9wVAUPLQvaFCszqo1paNw}{oloGCXQeRRibpUPN8vmUVQ}{10.221.42.246}{10.221.42.246:9301}{dil}{ml.machine_memory=67560857600, ml.max_open_jobs=20, xpack.installed=true, box_type=stale} reason: disconnected], term: 680, version: 1054743, delta: removed {{hw-sh-t-opslog-10-datanode_stale}{g9wVAUPLQvaFCszqo1paNw}{oloGCXQeRRibpUPN8vmUVQ}{10.221.42.246}{10.221.42.246:9301}{dil}{ml.machine_memory=67560857600, ml.max_open_jobs=20, xpack.installed=true, box_type=stale}}
[2020-11-29T10:19:31,238][INFO ][o.e.c.s.ClusterApplierService] [hw-sh-t-opslog-02-masternode]removed {{hw-sh-t-opslog-10-datanode_stale}{g9wVAUPLQvaFCszqo1paNw}{oloGCXQeRRibpUPN8vmUVQ}{10.221.42.246}{10.221.42.246:9301}{dil}{ml.machine_memory=67560857600, ml.max_open_jobs=20, xpack.installed=true, box_type=stale}}, term: 680, version: 1054743, reason: Publication{term=680, version=1054743}

How can i avoid the problem or slove it???Thanks a lot!

My cluster's config:

ElasticSearch version:7.5.1

elasticsearch.yml(master node):

bootstrap.memory_lock: true
cluster.max_shards_per_node: 10000
cluster.name: billions-uat7.5.1
cluster.routing.allocation.allow_rebalance: always
cluster.routing.allocation.cluster_concurrent_rebalance: 6
cluster.routing.allocation.node_concurrent_recoveries: 10
cluster.routing.allocation.node_initial_primaries_recoveries: 20
cluster.routing.allocation.same_shard.host: true
discovery.seed_hosts:
- hw-sh-t-opslog-01:9310
- hw-sh-t-opslog-02:9310
- hw-sh-t-opslog-03:9310

cluster.fault_detection.follower_check.interval: 10s
cluster.fault_detection.follower_check.timeout: 60s
cluster.fault_detection.follower_check.retry_count: 3
cluster.fault_detection.leader_check.interval: 10s
cluster.fault_detection.leader_check.timeout: 60s
cluster.fault_detection.leader_check.retry_count: 3

indices.breaker.total.use_real_memory: false

http.port: 9210
indices.recovery.max_bytes_per_sec: 500mb
network.host: 0.0.0.0
node.data: false
node.master: true
transport.tcp.port: 9310
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.keystore.path: /etc/elasticsearch/masternode/billions-certificates.p12
xpack.security.transport.ssl.truststore.path: /etc/elasticsearch/masternode/billions-certificates.p12
xpack.security.transport.ssl.verification_mode: certificate



node.name: hw-sh-t-opslog-02-masternode

path.data: /mnt/storage01/elasticsearch/data/hw-sh-t-opslog-02-masternode

path.logs: /mnt/storage01/elasticsearch/log/hw-sh-t-opslog-02-masternode


action.auto_create_index: true

xpack.security.enabled: true

elasticsearch.yml(data node):

bootstrap.memory_lock: true
cluster.max_shards_per_node: 10000
cluster.name: billions-uat7.5.1
cluster.routing.allocation.allow_rebalance: always
cluster.routing.allocation.cluster_concurrent_rebalance: 5
cluster.routing.allocation.node_concurrent_recoveries: 10
cluster.routing.allocation.node_initial_primaries_recoveries: 20
cluster.routing.allocation.same_shard.host: true
discovery.seed_hosts:
- hw-sh-t-opslog-01:9310
- hw-sh-t-opslog-02:9310
- hw-sh-t-opslog-03:9310
discovery.zen.fd.ping_interval: 10s
discovery.zen.fd.ping_retries: 3
discovery.zen.fd.ping_timeout: 60s
discovery.zen.ping_timeout: 10s
http.port: 9201
indices.recovery.max_bytes_per_sec: 500mb
network.host: 0.0.0.0
node.attr.box_type: stale
node.data: true
node.master: false
transport.tcp.port: 9301
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.keystore.path: /etc/elasticsearch/datanode_stale/billions-certificates.p12
xpack.security.transport.ssl.truststore.path: /etc/elasticsearch/datanode_stale/billions-certificates.p12
xpack.security.transport.ssl.verification_mode: certificate



node.name: hw-sh-t-opslog-10-datanode_stale

path.data: /mnt/storage01/hw-sh-t-opslog-10-datanode_stale

path.logs: /mnt/storage01/elasticsearch/log/hw-sh-t-opslog-10-datanode_stale


action.auto_create_index: true

xpack.security.enabled: true

It looks like you have overridden a lot of settings that are either there to protect the cluster or are expert level settings.

This is bad. The default limit of 1000 shards per data node is in my mind quite high and should generally not be increased. Having lots of small indices and shards in a cluster is very inefficient and can cause problems with performance and stability as it often results in a very large cluster state that can be slow to propagate and require frequent updates.

Why have you overridden these?

How did you arrive at these settings?

How large is the cluster? How is it deployed? What type of hardware and storage is used?

Can you provide the full output of the cluster stats API?

Have you verified that you have full connectivity in both directions between all nodes in the cluster, e.g. telnet from master to data node as well as the other way around?

The cluster has 13 VM servers and 26 nodes.We use SSD to store hot data and HDD to store stale data. The problem often happened on the stale data node(Use HDD).

​ We have verified the connectivity both directions between all nodes in the cluster by telnet and there is no problem.

​ Here is the output of the cluster stats API.

{ - 
  "_nodes": { - 
    "total": 26,
    "successful": 26,
    "failed": 0
  },
  "cluster_name": "billions-uat7.5.1",
  "cluster_uuid": "RCBZWPQMTJWb2IA8Q4HccA",
  "timestamp": 1606722781365,
  "status": "green",
  "indices": { - 
    "count": 6974,
    "shards": { - 
      "total": 20557,
      "primaries": 14079,
      "replication": 0.4601179061012856,
      "index": { - 
        "shards": { - 
          "min": 2,
          "max": 20,
          "avg": 2.94766274734729
        },
        "primaries": { - 
          "min": 1,
          "max": 10,
          "avg": 2.0187840550616576
        },
        "replication": { - 
          "min": 0,
          "max": 1,
          "avg": 0.46142816174361917
        }
      }
    },
    "docs": { - 
      "count": 45095621439,
      "deleted": 228
    },
    "store": { - 
      "size_in_bytes": 11545217402568
    },
    "fielddata": { - 
      "memory_size_in_bytes": 1341184,
      "evictions": 0
    },
    "query_cache": { - 
      "memory_size_in_bytes": 228004327,
      "total_count": 17475604,
      "hit_count": 3846204,
      "miss_count": 13629400,
      "cache_size": 20488,
      "cache_count": 86706,
      "evictions": 66218
    },
    "completion": { - 
      "size_in_bytes": 0
    },
    "segments": { - 
      "count": 149227,
      "memory_in_bytes": 23365042509,
      "terms_memory_in_bytes": 14242750375,
      "stored_fields_memory_in_bytes": 7947083408,
      "term_vectors_memory_in_bytes": 0,
      "norms_memory_in_bytes": 97500032,
      "points_memory_in_bytes": 1029221226,
      "doc_values_memory_in_bytes": 48487468,
      "index_writer_memory_in_bytes": 2457968256,
      "version_map_memory_in_bytes": 0,
      "fixed_bit_set_memory_in_bytes": 784,
      "max_unsafe_auto_id_timestamp": 1606698467398,
      "file_sizes": { - 

      }
    }
  },
  "nodes": { - 
    "count": { - 
      "total": 26,
      "coordinating_only": 0,
      "data": 23,
      "ingest": 26,
      "master": 3,
      "ml": 26,
      "voting_only": 0
    },
    "versions": [ - 
      "7.5.1"
    ],
    "os": { - 
      "available_processors": 832,
      "allocated_processors": 832,
      "names": [ - 
        { - 
          "name": "Linux",
          "count": 26
        }
      ],
      "pretty_names": [ - 
        { - 
          "pretty_name": "Debian GNU/Linux 9 (stretch)",
          "count": 26
        }
      ],
      "mem": { - 
        "total_in_bytes": 1756582281216,
        "free_in_bytes": 59908493312,
        "used_in_bytes": 1696673787904,
        "free_percent": 3,
        "used_percent": 97
      }
    },
    "process": { - 
      "cpu": { - 
        "percent": 254
      },
      "open_file_descriptors": { - 
        "min": 1521,
        "max": 20292,
        "avg": 10625
      }
    },
    "jvm": { - 
      "max_uptime_in_millis": 10459340426,
      "versions": [ - 
        { - 
          "version": "11.0.5",
          "vm_name": "Java HotSpot(TM) 64-Bit Server VM",
          "vm_version": "11.0.5+10-LTS",
          "vm_vendor": "Oracle Corporation",
          "bundled_jdk": true,
          "using_bundled_jdk": false,
          "count": 26
        }
      ],
      "mem": { - 
        "heap_used_in_bytes": 208099282104,
        "heap_max_in_bytes": 494994980864
      },
      "threads": 8438
    },
    "fs": { - 
      "total_in_bytes": 125715375738880,
      "free_in_bytes": 118067256119296,
      "available_in_bytes": 118067256119296
    },
    "plugins": [ - 

    ],
    "network_types": { - 
      "transport_types": { - 
        "security4": 26
      },
      "http_types": { - 
        "security4": 26
      }
    },
    "discovery_types": { - 
      "zen": 26
    },
    "packaging_types": [ - 
      { - 
        "flavor": "default",
        "type": "deb",
        "count": 26
      }
    ]
  }
}

You have a lot of indices and shards given the amount of data you have in the cluster. If I calculate correctly it seems like your average shard size is considerably less than 1GB, which is very small. The recommended shard size for time based indices is often 30GB to 50GB. I would recommend you look to reduce the shard count significantly and change how you shard data. Once the shard count per node is down I would recommend going back to the default limit of shards per node.

It also seems like your master nodes are ingest nodes. Make sure that the dedicated master nodes are not handling traffic and not processing ingest pipelines.

It did not see any response to how you arrived at the settings you did override.

If your data nodes are using HDD I doubt this setting (together with the other overrides) is appropriate as it might lead to a lot of disk IO which could interfere with persisting the updated cluster state.

The node was repeatedly leaving the cluster because something (not Elasticsearch) was causing TCP-level disconnections:

Without understanding what that "something" was, and why it was disrupting TCP connections, it's impossible to say how to avoid this in future.

Also to reiterate Christian's point, these settings are listed under the expert settings section of the docs, which comes with the following warning:

WARNING: If you adjust these settings then your cluster may not form correctly or may become unstable or intolerant of certain failures.

Since you adjusted these settings, you should not be surprised that things aren't working properly any more.

1 Like

Thanks a lot, we'll try to optimize the shard size.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.