I'm running an es cluster (7.4) with 3 master, 3 data, and 1 coord node.
After running smoothly for months, nodes began to continually disconnect/reconnect from the cluster. From observing, it seems like 1 particular master node and 2 particular data nodes are the culprits, but I'm very lost on how to tackle this problem.
some logs from master node that keeps disconnecting:
[2020-08-04T10:35:10,908][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [elk00-master] collector [cluster_stats] timed out when collecting data
[2020-08-04T10:35:12,471][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [elk00-master] failed to execute on node [hF7l7lJqToywM0Ol2X1yMw]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [elk05-data][10.254.113.18:9300][cluster:monitor/nodes/stats[n]] request_id [602470] timed out after [15012ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1022) [elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) [elasticsearch-7.4.0.jar:7.4.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:830) [?:?]
[2020-08-04T10:35:12,814][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [elk00-master] failed to execute on node [hF7l7lJqToywM0Ol2X1yMw]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [elk05-data][10.254.113.18:9300][cluster:monitor/nodes/stats[n]] request_id [602489] timed out after [15012ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1022) [elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) [elasticsearch-7.4.0.jar:7.4.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:830) [?:?]
[2020-08-04T10:35:27,492][INFO ][o.e.c.s.ClusterApplierService] [elk00-master] added {{elk04-data}{6kbmixx4T5e9DlvaYgZpqA}{bXRbtweXSxqJi4M1TeDdxA}{10.254.113.40}{10.254.113.40:9300}{dil}{ml.machine_memory=33565130752, ml.max_open_jobs=20, xpack.installed=true},}, term: 1527, version: 64772, reason: Publication{term=1527, version=64772}
[2020-08-04T10:35:27,498][WARN ][o.e.c.c.C.CoordinatorPublication] [elk00-master] after [30s] publication of cluster state version [64772] is still waiting for {elk01-master}{A6CKDkBgTEmSRofFg7COjQ}{yxZzjWolTGWGVZpfID_e6A}{10.254.113.45}{10.254.113.45:9300}{lm}{ml.machine_memory=33565290496, ml.max_open_jobs=20, xpack.installed=true} [SENT_PUBLISH_REQUEST], {elk04-data}{6kbmixx4T5e9DlvaYgZpqA}{bXRbtweXSxqJi4M1TeDdxA}{10.254.113.40}{10.254.113.40:9300}{dil}{ml.machine_memory=33565130752, ml.max_open_jobs=20, xpack.installed=true} [SENT_APPLY_COMMIT], {elk05-data}{hF7l7lJqToywM0Ol2X1yMw}{3k5YYeu3S0-4mVJwbuK5yA}{10.254.113.18}{10.254.113.18:9300}{dil}{ml.machine_memory=33565298688, ml.max_open_jobs=20, xpack.installed=true} [SENT_APPLY_COMMIT]
some logs from data node that keeps disconnecting:
[2020-08-04T11:09:17,483][INFO ][o.e.c.c.JoinHelper ] [elk04-data] failed to join {elk00-master}{VPsRjHluQPq32otzybp0kg}{aYewTJWiQgW_OPiBSfvWJA}{10.254.113.51}{10.254.113.51:9300}{lm}{ml.machine_memory=33565298688, ml.max_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={elk04-data}{6kbmixx4T5e9DlvaYgZpqA}{bXRbtweXSxqJi4M1TeDdxA}{10.254.113.40}{10.254.113.40:9300}{dil}{ml.machine_memory=33565130752, xpack.installed=true, ml.max_open_jobs=20}, optionalJoin=Optional[Join{term=1527, lastAcceptedTerm=1522, lastAcceptedVersion=62819, sourceNode={elk04-data}{6kbmixx4T5e9DlvaYgZpqA}{bXRbtweXSxqJi4M1TeDdxA}{10.254.113.40}{10.254.113.40:9300}{dil}{ml.machine_memory=33565130752, xpack.installed=true, ml.max_open_jobs=20}, targetNode={elk00-master}{VPsRjHluQPq32otzybp0kg}{aYewTJWiQgW_OPiBSfvWJA}{10.254.113.51}{10.254.113.51:9300}{lm}{ml.machine_memory=33565298688, ml.max_open_jobs=20, xpack.installed=true}}]}
org.elasticsearch.transport.ReceiveTimeoutTransportException: [elk00-master][10.254.113.51:9300][internal:cluster/coordination/join] request_id [11269] timed out after [59847ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1022) [elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) [elasticsearch-7.4.0.jar:7.4.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:830) [?:?]
[2020-08-04T11:09:27,485][INFO ][o.e.c.c.JoinHelper ] [elk04-data] last failed join attempt was 10s ago, failed to join {elk00-master}{VPsRjHluQPq32otzybp0kg}{aYewTJWiQgW_OPiBSfvWJA}{10.254.113.51}{10.254.113.51:9300}{lm}{ml.machine_memory=33565298688, ml.max_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={elk04-data}{6kbmixx4T5e9DlvaYgZpqA}{bXRbtweXSxqJi4M1TeDdxA}{10.254.113.40}{10.254.113.40:9300}{dil}{ml.machine_memory=33565130752, xpack.installed=true, ml.max_open_jobs=20}, optionalJoin=Optional[Join{term=1527, lastAcceptedTerm=1522, lastAcceptedVersion=62819, sourceNode={elk04-data}{6kbmixx4T5e9DlvaYgZpqA}{bXRbtweXSxqJi4M1TeDdxA}{10.254.113.40}{10.254.113.40:9300}{dil}{ml.machine_memory=33565130752, xpack.installed=true, ml.max_open_jobs=20}, targetNode={elk00-master}{VPsRjHluQPq32otzybp0kg}{aYewTJWiQgW_OPiBSfvWJA}{10.254.113.51}{10.254.113.51:9300}{lm}{ml.machine_memory=33565298688, ml.max_open_jobs=20, xpack.installed=true}}]}
org.elasticsearch.transport.ReceiveTimeoutTransportException: [elk00-master][10.254.113.51:9300][internal:cluster/coordination/join] request_id [11269] timed out after [59847ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1022) ~[elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) ~[elasticsearch-7.4.0.jar:7.4.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:830) [?:?]
[2020-08-04T11:09:47,490][WARN ][o.e.c.c.ClusterFormationFailureHelper] [elk04-data] master not discovered yet: have discovered [{elk04-data}{6kbmixx4T5e9DlvaYgZpqA}{bXRbtweXSxqJi4M1TeDdxA}{10.254.113.40}{10.254.113.40:9300}{dil}{ml.machine_memory=33565130752, xpack.installed=true, ml.max_open_jobs=20}, {elk02-master}{zllmwxZVSLqkmm6dRtu0Bw}{IA6y_BnoTuyilEEBlcZnlg}{10.254.113.28}{10.254.113.28:9300}{lm}{ml.machine_memory=33565298688, ml.max_open_jobs=20, xpack.installed=true}, {elk00-master}{VPsRjHluQPq32otzybp0kg}{aYewTJWiQgW_OPiBSfvWJA}{10.254.113.51}{10.254.113.51:9300}{lm}{ml.machine_memory=33565298688, ml.max_open_jobs=20, xpack.installed=true}, {elk01-master}{A6CKDkBgTEmSRofFg7COjQ}{yxZzjWolTGWGVZpfID_e6A}{10.254.113.45}{10.254.113.45:9300}{lm}{ml.machine_memory=33565290496, ml.max_open_jobs=20, xpack.installed=true}]; discovery will continue using [10.254.113.51:9300, 10.254.113.45:9300, 10.254.113.28:9300] from hosts providers and [{elk02-master}{zllmwxZVSLqkmm6dRtu0Bw}{IA6y_BnoTuyilEEBlcZnlg}{10.254.113.28}{10.254.113.28:9300}{lm}{ml.machine_memory=33565298688, ml.max_open_jobs=20, xpack.installed=true}, {elk00-master}{VPsRjHluQPq32otzybp0kg}{aYewTJWiQgW_OPiBSfvWJA}{10.254.113.51}{10.254.113.51:9300}{lm}{ml.machine_memory=33565298688, ml.max_open_jobs=20, xpack.installed=true}, {elk01-master}{A6CKDkBgTEmSRofFg7COjQ}{yxZzjWolTGWGVZpfID_e6A}{10.254.113.45}{10.254.113.45:9300}{lm}{ml.machine_memory=33565290496, ml.max_open_jobs=20, xpack.installed=true}] from last-known cluster state; node term 1527, last-accepted version 65165 in term 1527
[2020-08-04T11:09:51,591][ERROR][o.e.x.m.c.n.NodeStatsCollector] [elk04-data] collector [node_stats] timed out when collecting data
_cat/health?v&ts=false&pretty:
cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
T2-OSS02 red 5 2 158 158 0 6 882 52 49.1m 15.1%
I believe the cause was due to some disk related issue on the RHEL servers that these nodes are hosted on, resulting in some data loss on their end. They say everything is stable again on their end, but I've tried to restart the cluster with no luck.
Please let me know if you have any input!
Thank you