Hi everyone,
I have a cluster with 9 nodes: 3 master nodes and 6 data nodes.
Elasticsearch version: 7.10.2
- Nodes are hosted on: AWS EC2 instances
- Master node configuration: t4g.medium (2 vCPU, 4 GB RAM)
- Data node configuration: t4g.xlarge (4 vCPU, 16 GB RAM)
- Elasticsearch RAM allocation: Half of the instance's RAM
For the past 2 months, I have occasionally experienced cluster crashes almost every week, sometimes 2,3 times a week due to loss of connection between nodes.
There has been no CPU or RAM overload on either the master or data nodes, and there have been no unusual commands executed on the cluster.
I don’t believe AWS is having network issues between the nodes so frequently.
The only potential cause I can think of is having too many shards. I currently have around 32,000 active shards, while the recommended number is 1,000 shards per node.
Could anyone suggest other possible reasons for these incidents?
Here are the relevant logs:
[2024-08-29T10:57:11,230][WARN ][o.e.t.TransportService ] [master-1] Received response for a request that has timed out, sent [13406ms] ago, timed out [3402ms] ago, action [internal:coordination/fault_detection/follower_check], node [{data_1}{<node_id_1>}{<uuid_1>}{<ip_1>}{<ip_1>:9300}{d}{xpack.installed=true, transform.node=false}], id [<id_1>]
[2024-08-29T10:57:31,466][INFO ][o.e.c.s.MasterService ] [master-1] node-left[{data_0}{<node_id_2>}{<uuid_2>}{<ip_2>}{<ip_2>:9300}{d}{xpack.installed=true, transform.node=false} reason: followers check retry count exceeded], term: 17939, version: 489631, delta: removed {{data_0}{<node_id_2>}{<uuid_2>}{<ip_2>}{<ip_2>:9300}{d}{xpack.installed=true, transform.node=false}}
[2024-08-29T10:57:38,208][INFO ][o.e.c.c.Coordinator ] [data_0] master node [{master-1}{<node_id_3>}{<uuid_3>}{<ip_3>}{<ip_3>:9300}{m}{xpack.installed=true, transform.node=false}] failed, restarting discovery
org.elasticsearch.ElasticsearchException: node [{master-1}{<node_id_3>}{<uuid_3>}{<ip_3>}{<ip_3>:9300}{m}{xpack.installed=true, transform.node=false}] failed [3] consecutive checks
at org.elasticsearch.cluster.coordination.LeaderChecker$CheckScheduler$1.handleException(LeaderChecker.java:293) ~[elasticsearch-7.10.2.jar:7.10.2]
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1181)
...
Caused by: org.elasticsearch.transport.RemoteTransportException: [master-1][<ip_3>:9300][internal:coordination/fault_detection/leader_check]
Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: rejecting leader check since [{data_0}{<node_id_2>}{<uuid_2>}{<ip_2>}{<ip_2>:9300}{d}{xpack.installed=true, transform.node=false}] has been removed from the cluster
at org.elasticsearch.cluster.coordination.LeaderChecker.handleLeaderCheck(LeaderChecker.java:192) ~[elasticsearch-7.10.2.jar:7.10.2]
...
[2024-08-29T10:57:38,212][INFO ][o.e.c.s.ClusterApplierService] [data_0] master node changed {previous [{master-1}{<node_id_3>}{<uuid_3>}{<ip_3>}{<ip_3>:9300}{m}{xpack.installed=true, transform.node=false}], current []}, term: 17939, version: 489630, reason: becoming candidate: onLeaderFailure
[2024-08-29T10:57:39,244][WARN ][o.e.c.s.DiagnosticTrustManager] [data_0] failed to establish trust with server at [<unknown host>]; the server provided a certificate with subject name [CN=instance] and fingerprint [<fingerprint_1>]; the certificate does not have any subject alternative names; the certificate is issued by [CN=Elastic Certificate Tool Autogenerated CA]; the certificate is signed by (subject [CN=Elastic Certificate Tool Autogenerated CA] fingerprint [<fingerprint_2>]) which is self-issued; the [CN=Elastic Certificate Tool Autogenerated CA] certificate is not trusted in this ssl context ([xpack.security.transport.ssl]); this ssl context does trust a certificate with subject [CN=Elastic Certificate Tool Autogenerated CA] but the trusted certificate has fingerprint [<fingerprint_3>]
sun.security.validator.ValidatorException: PKIX path validation failed: java.security.cert.CertPathValidatorException: Path does not chain with any of the trust anchors