Elastic Cluster suddenly lost node

Hi everyone,

I have a cluster with 9 nodes: 3 master nodes and 6 data nodes.

Elasticsearch version: 7.10.2

  • Nodes are hosted on: AWS EC2 instances
  • Master node configuration: t4g.medium (2 vCPU, 4 GB RAM)
  • Data node configuration: t4g.xlarge (4 vCPU, 16 GB RAM)
  • Elasticsearch RAM allocation: Half of the instance's RAM

For the past 2 months, I have occasionally experienced cluster crashes almost every week, sometimes 2,3 times a week due to loss of connection between nodes.

There has been no CPU or RAM overload on either the master or data nodes, and there have been no unusual commands executed on the cluster.

I don’t believe AWS is having network issues between the nodes so frequently.

The only potential cause I can think of is having too many shards. I currently have around 32,000 active shards, while the recommended number is 1,000 shards per node.

Could anyone suggest other possible reasons for these incidents?

Here are the relevant logs:

[2024-08-29T10:57:11,230][WARN ][o.e.t.TransportService   ] [master-1] Received response for a request that has timed out, sent [13406ms] ago, timed out [3402ms] ago, action [internal:coordination/fault_detection/follower_check], node [{data_1}{<node_id_1>}{<uuid_1>}{<ip_1>}{<ip_1>:9300}{d}{xpack.installed=true, transform.node=false}], id [<id_1>]
[2024-08-29T10:57:31,466][INFO ][o.e.c.s.MasterService    ] [master-1] node-left[{data_0}{<node_id_2>}{<uuid_2>}{<ip_2>}{<ip_2>:9300}{d}{xpack.installed=true, transform.node=false} reason: followers check retry count exceeded], term: 17939, version: 489631, delta: removed {{data_0}{<node_id_2>}{<uuid_2>}{<ip_2>}{<ip_2>:9300}{d}{xpack.installed=true, transform.node=false}}

[2024-08-29T10:57:38,208][INFO ][o.e.c.c.Coordinator      ] [data_0] master node [{master-1}{<node_id_3>}{<uuid_3>}{<ip_3>}{<ip_3>:9300}{m}{xpack.installed=true, transform.node=false}] failed, restarting discovery
org.elasticsearch.ElasticsearchException: node [{master-1}{<node_id_3>}{<uuid_3>}{<ip_3>}{<ip_3>:9300}{m}{xpack.installed=true, transform.node=false}] failed [3] consecutive checks
        at org.elasticsearch.cluster.coordination.LeaderChecker$CheckScheduler$1.handleException(LeaderChecker.java:293) ~[elasticsearch-7.10.2.jar:7.10.2]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1181)
	...
Caused by: org.elasticsearch.transport.RemoteTransportException: [master-1][<ip_3>:9300][internal:coordination/fault_detection/leader_check]
Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: rejecting leader check since [{data_0}{<node_id_2>}{<uuid_2>}{<ip_2>}{<ip_2>:9300}{d}{xpack.installed=true, transform.node=false}] has been removed from the cluster
        at org.elasticsearch.cluster.coordination.LeaderChecker.handleLeaderCheck(LeaderChecker.java:192) ~[elasticsearch-7.10.2.jar:7.10.2]
	...

[2024-08-29T10:57:38,212][INFO ][o.e.c.s.ClusterApplierService] [data_0] master node changed {previous [{master-1}{<node_id_3>}{<uuid_3>}{<ip_3>}{<ip_3>:9300}{m}{xpack.installed=true, transform.node=false}], current []}, term: 17939, version: 489630, reason: becoming candidate: onLeaderFailure

[2024-08-29T10:57:39,244][WARN ][o.e.c.s.DiagnosticTrustManager] [data_0] failed to establish trust with server at [<unknown host>]; the server provided a certificate with subject name [CN=instance] and fingerprint [<fingerprint_1>]; the certificate does not have any subject alternative names; the certificate is issued by [CN=Elastic Certificate Tool Autogenerated CA]; the certificate is signed by (subject [CN=Elastic Certificate Tool Autogenerated CA] fingerprint [<fingerprint_2>]) which is self-issued; the [CN=Elastic Certificate Tool Autogenerated CA] certificate is not trusted in this ssl context ([xpack.security.transport.ssl]); this ssl context does trust a certificate with subject [CN=Elastic Certificate Tool Autogenerated CA] but the trusted certificate has fingerprint [<fingerprint_3>]
sun.security.validator.ValidatorException: PKIX path validation failed: java.security.cert.CertPathValidatorException: Path does not chain with any of the trust anchors

The handling of large number of shards have been improved a lot in recent versions, but as you are on a very old version I believe the old guidelines still apply. 1000 shards per node is the maximum per node allowed as per the default settings. The general recommendation is to keep the number of shards below 20 per GB of heap (this is a maximum, not a level to aim for), which in your case would mean less than 1080 shards in the cluster as a whole if I calculated correctly. You are therefore exceptionally oversharded and at this level I am not surprised you are experiencing stability problems.

I would recommend two things:

  • Upgrade to the latest version of Elasticsearch as soon as possible
  • Dramatically reduce the number of shards in your cluster and change how you shard data.

I also see that you are running on instances with burstable CPU. This may also contribute to stability issues as node may get starved of CPU if they experience a period of elevated CPU usage. This type of instances can be OK for dedicated master nodes but I would not recommend using them for data nodes.

2 Likes