ElasticSearch 7.15.1 / Unhappy Cluster

I am trying to troubleshoot an Elasticsearch 7.15.1 cluster that has 3 master nodes, and 30+ data nodes. Recently we have seen where data nodes are complaining about "master not discovered yet" after the nodes have been running a while. When we start the data nodes they run fine for a while then get stupid after some time.

On the master we will see messages like this:

[2023-01-11T21:35:41,500][INFO ][o.e.c.s.ClusterApplierService] [esm02-master]added {{ess17-3}{9CKKdeNyTyiCNht2r9GS5A}{TEWP45M7TCeunwXh7D1bkA}{10.X.X6.250}{10.X.X6.250:9303}{isw}, {esh18-3}{Mm2c5CQZTZqim8VHGycoxQ}{6jbH9KriS3Sm9waH-wByCA}{10.X.X7.252}{10.X.X7.252:9303}{h}}, term: 244, version: 2814273, reason: ApplyCommitRequest{term=244, version=2814273, sourceNode={esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}{xpack.installed=true, transform.node=false}}
[2023-01-11T21:35:46,339][INFO ][o.e.c.s.ClusterApplierService] [esm02-master]removed {{ess17-3}{9CKKdeNyTyiCNht2r9GS5A}{TEWP45M7TCeunwXh7D1bkA}{10.X.X6.250}{10.X.X6.250:9303}{isw}, {esh18-3}{Mm2c5CQZTZqim8VHGycoxQ}{6jbH9KriS3Sm9waH-wByCA}{10.X.X7.252}{10.X.X7.252:9303}{h}}, term: 244, version: 2814274, reason: ApplyCommitRequest{term=244, version=2814274, sourceNode={esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}{xpack.installed=true, transform.node=false}}
[2023-01-11T21:35:48,490][INFO ][o.e.c.s.ClusterApplierService] [esm02-master]added {{ess17-3}{9CKKdeNyTyiCNht2r9GS5A}{TEWP45M7TCeunwXh7D1bkA}{10.X.X6.250}{10.X.X6.250:9303}{isw}, {esh18-3}{Mm2c5CQZTZqim8VHGycoxQ}{6jbH9KriS3Sm9waH-wByCA}{10.X.X7.252}{10.X.X7.252:9303}{h}}, term: 244, version: 2814276, reason: ApplyCommitRequest{term=244, version=2814276, sourceNode={esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}{xpack.installed=true, transform.node=false}}
[2023-01-11T21:35:53,744][INFO ][o.e.c.s.ClusterApplierService] [esm02-master]removed {{ess17-3}{9CKKdeNyTyiCNht2r9GS5A}{TEWP45M7TCeunwXh7D1bkA}{10.X.X6.250}{10.X.X6.250:9303}{isw}, {esh18-3}{Mm2c5CQZTZqim8VHGycoxQ}{6jbH9KriS3Sm9waH-wByCA}{10.X.X7.252}{10.X.X7.252:9303}{h}}, term: 244, version: 2814277, reason: ApplyCommitRequest{term=244, version=2814277, sourceNode={esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}{xpack.installed=true, transform.node=false}}
[2023-01-11T21:35:56,493][INFO ][o.e.c.s.ClusterApplierService] [esm02-master]added {{esh18-3}{Mm2c5CQZTZqim8VHGycoxQ}{6jbH9KriS3Sm9waH-wByCA}{10.X.X7.252}{10.X.X7.252:9303}{h}, {ess17-3}{9CKKdeNyTyiCNht2r9GS5A}{TEWP45M7TCeunwXh7D1bkA}{10.X.X6.250}{10.X.X6.250:9303}{isw}}, term: 244, version: 2814279, reason: ApplyCommitRequest{term=244, version=2814279, sourceNode={esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}{xpack.installed=true, transform.node=false}}

on the data nodes we will see stuff like this as well:

2023-01-11T21:35:20,836][WARN ][o.e.c.c.ClusterFormationFailureHelper] [ess17-3]master not discovered yet: have discovered [{ess17-3}{9CKKdeNyTyiCNht2r9GS5A}{TEWP45M7TCeunwXh7D1bkA}{10.X.X6.250}{10.X.X6.250:9303}{isw}, {esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}, {esm02-master}{Try9HeLLRgKVyjvd0yoZ2g}{NXS1t1hiQOiNF5EC9lJB8A}{10.X.X6.19}{10.X.X6.19:9303}{m}, {esm01-master}{C78EUk7_Tk-cHlNZUviuSg}{dblN4N_jTkGK-3m1QhuJ5A}{10.X.X6.55}{10.X.X6.55:9303}{m}]; discovery will continue using [10.X.X6.55:9303, 10.X.X6.19:9303, 10.X.X6.56:9303] from hosts providers and [{esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}, {esm02-master}{Try9HeLLRgKVyjvd0yoZ2g}{NXS1t1hiQOiNF5EC9lJB8A}{10.X.X6.19}{10.X.X6.19:9303}{m}, {esm01-master}{C78EUk7_Tk-cHlNZUviuSg}{dblN4N_jTkGK-3m1QhuJ5A}{10.X.X6.55}{10.X.X6.55:9303}{m}] from last-known cluster state; node term 244, last-accepted version 2814228 in term 244
[2023-01-11T21:35:30,899][WARN ][o.e.c.c.ClusterFormationFailureHelper] [ess17-3]master not discovered yet: have discovered [{ess17-3}{9CKKdeNyTyiCNht2r9GS5A}{TEWP45M7TCeunwXh7D1bkA}{10.X.X6.250}{10.X.X6.250:9303}{isw}, {esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}, {esm02-master}{Try9HeLLRgKVyjvd0yoZ2g}{NXS1t1hiQOiNF5EC9lJB8A}{10.X.X6.19}{10.X.X6.19:9303}{m}, {esm01-master}{C78EUk7_Tk-cHlNZUviuSg}{dblN4N_jTkGK-3m1QhuJ5A}{10.X.X6.55}{10.X.X6.55:9303}{m}]; discovery will continue using [10.X.X6.55:9303, 10.X.X6.19:9303, 10.X.X6.56:9303] from hosts providers and [{esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}, {esm02-master}{Try9HeLLRgKVyjvd0yoZ2g}{NXS1t1hiQOiNF5EC9lJB8A}{10.X.X6.19}{10.X.X6.19:9303}{m}, {esm01-master}{C78EUk7_Tk-cHlNZUviuSg}{dblN4N_jTkGK-3m1QhuJ5A}{10.X.X6.55}{10.X.X6.55:9303}{m}] from last-known cluster state; node term 244, last-accepted version 2814228 in term 244
[2023-01-11T21:35:40,965][WARN ][o.e.c.c.ClusterFormationFailureHelper] [ess17-3]master not discovered yet: have discovered [{ess17-3}{9CKKdeNyTyiCNht2r9GS5A}{TEWP45M7TCeunwXh7D1bkA}{10.X.X6.250}{10.X.X6.250:9303}{isw}, {esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}, {esm02-master}{Try9HeLLRgKVyjvd0yoZ2g}{NXS1t1hiQOiNF5EC9lJB8A}{10.X.X6.19}{10.X.X6.19:9303}{m}, {esm01-master}{C78EUk7_Tk-cHlNZUviuSg}{dblN4N_jTkGK-3m1QhuJ5A}{10.X.X6.55}{10.X.X6.55:9303}{m}]; discovery will continue using [10.X.X6.55:9303, 10.X.X6.19:9303, 10.X.X6.56:9303] from hosts providers and [{esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}, {esm02-master}{Try9HeLLRgKVyjvd0yoZ2g}{NXS1t1hiQOiNF5EC9lJB8A}{10.X.X6.19}{10.X.X6.19:9303}{m}, {esm01-master}{C78EUk7_Tk-cHlNZUviuSg}{dblN4N_jTkGK-3m1QhuJ5A}{10.X.X6.55}{10.X.X6.55:9303}{m}] from last-known cluster state; node term 244, last-accepted version 2814228 in term 244
[2023-01-11T21:35:51,032][WARN ][o.e.c.c.ClusterFormationFailureHelper] [ess17-3]master not discovered yet: have discovered [{ess17-3}{9CKKdeNyTyiCNht2r9GS5A}{TEWP45M7TCeunwXh7D1bkA}{10.X.X6.250}{10.X.X6.250:9303}{isw}, {esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}, {esm02-master}{Try9HeLLRgKVyjvd0yoZ2g}{NXS1t1hiQOiNF5EC9lJB8A}{10.X.X6.19}{10.X.X6.19:9303}{m}, {esm01-master}{C78EUk7_Tk-cHlNZUviuSg}{dblN4N_jTkGK-3m1QhuJ5A}{10.X.X6.55}{10.X.X6.55:9303}{m}]; discovery will continue using [10.X.X6.55:9303, 10.X.X6.19:9303, 10.X.X6.56:9303] from hosts providers and [{esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}, {esm02-master}{Try9HeLLRgKVyjvd0yoZ2g}{NXS1t1hiQOiNF5EC9lJB8A}{10.X.X6.19}{10.X.X6.19:9303}{m}, {esm01-master}{C78EUk7_Tk-cHlNZUviuSg}{dblN4N_jTkGK-3m1QhuJ5A}{10.X.X6.55}{10.X.X6.55:9303}{m}] from last-known cluster state; node term 244, last-accepted version 2814228 in term 244
[2023-01-11T21:35:53,987][INFO ][o.e.c.c.JoinHelper       ] [ess17-3]failed to join {esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}{xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={ess17-3}{9CKKdeNyTyiCNht2r9GS5A}{TEWP45M7TCeunwXh7D1bkA}{10.X.X6.250}{10.X.X6.250:9303}{isw}{box-type=warm, xpack.installed=true, transform.node=false, host=ess17, gateway=true}, minimumTerm=244, optionalJoin=Optional.empty}
org.elasticsearch.transport.RemoteTransportException: [esm03-master][10.X.X6.56:9303][internal:cluster/coordination/join]
Caused by: java.lang.IllegalStateException: failure when sending a validation request to node
	at org.elasticsearch.cluster.coordination.Coordinator$2.onFailure(Coordinator.java:509) ~[elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:48) ~[elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1289) ~[elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.transport.TransportService$8.run(TransportService.java:1151) ~[elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:678) ~[elasticsearch-7.15.1.jar:7.15.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: org.elasticsearch.transport.NodeDisconnectedException: [ess17-3][10.X.X6.250:9303][internal:cluster/coordination/join/validate] disconnected
[2023-01-11T21:35:53,994][INFO ][o.e.c.c.JoinHelper       ] [ess17-3]failed to join {esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}{xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={ess17-3}{9CKKdeNyTyiCNht2r9GS5A}{TEWP45M7TCeunwXh7D1bkA}{10.X.X6.250}{10.X.X6.250:9303}{isw}{box-type=warm, xpack.installed=true, transform.node=false, host=ess17, gateway=true}, minimumTerm=244, optionalJoin=Optional.empty}
org.elasticsearch.transport.RemoteTransportException: [esm03-master][10.X.X6.56:9303][internal:cluster/coordination/join]
Caused by: java.lang.IllegalStateException: failure when sending a validation request to node
	at org.elasticsearch.cluster.coordination.Coordinator$2.onFailure(Coordinator.java:509) ~[elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:48) ~[elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1289) ~[elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.transport.TransportService$8.run(TransportService.java:1151) ~[elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:678) ~[elasticsearch-7.15.1.jar:7.15.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: org.elasticsearch.transport.NodeDisconnectedException: [ess17-3][10.X.X6.250:9303][internal:cluster/coordination/join/validate] disconnected
[2023-01-11T21:35:56,977][INFO ][o.e.c.s.ClusterApplierService] [ess17-3]master node changed {previous [], current [{esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}]}, term: 244, version: 2814279, reason: ApplyCommitRequest{term=244, version=2814279, sourceNode={esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}{xpack.installed=true, transform.node=false}}
[2023-01-11T21:36:00,882][INFO ][o.e.c.s.ClusterApplierService] [ess17-3]removed {{esh18-3}{Mm2c5CQZTZqim8VHGycoxQ}{6jbH9KriS3Sm9waH-wByCA}{10.X.X7.252}{10.X.X7.252:9303}{h}}, term: 244, version: 2814280, reason: ApplyCommitRequest{term=244, version=2814280, sourceNode={esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}{xpack.installed=true, transform.node=false}}
[2023-01-11T21:36:04,364][INFO ][o.e.c.s.ClusterApplierService] [ess17-3]added {{esh18-3}{Mm2c5CQZTZqim8VHGycoxQ}{6jbH9KriS3Sm9waH-wByCA}{10.X.X7.252}{10.X.X7.252:9303}{h}}, term: 244, version: 2814283, reason: ApplyCommitRequest{term=244, version=2814283, sourceNode={esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}{xpack.installed=true, transform.node=false}}

At times we will get messages on the data node like this as well:

[2023-01-11T21:36:21,281][WARN ][o.e.c.a.s.ShardStateAction] [ess17-3]unexpected failure while sending request [internal:cluster/shard/failure] to [{esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}{xpack.installed=true, transform.node=false}] for shard entry [shard id [[.ds-myapp2-logs-2023.01.09-000001][7]], allocation id [CfcA8xrgTEy8fcKc0wCK5Q], primary term [11], message [failed to perform indices:data/write/bulk[s] on replica [.ds-myapp2-logs-2023.01.09-000001][7], node[CP8JKRRHRI2WOdjU6RFDzw], [R], s[STARTED], a[id=CfcA8xrgTEy8fcKc0wCK5Q]], failure [RemoteTransportException[[ess09-3][10.X.X6.158:9303][indices:data/write/bulk[s][r]]]; nested: IllegalStateException[[.ds-myapp2-logs-2023.01.09-000001][7] operation primary term [11] is too old (current [12])]; ], markAsStale [true]]
org.elasticsearch.transport.RemoteTransportException: [esm03-master][10.X.X6.56:9303][internal:cluster/shard/failure]
Caused by: org.elasticsearch.cluster.action.shard.ShardStateAction$NoLongerPrimaryShardException: primary term [11] did not match current primary term [12]
	at org.elasticsearch.cluster.action.shard.ShardStateAction$ShardFailedClusterStateTaskExecutor.execute(ShardStateAction.java:362) ~[elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:706) ~[elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:328) ~[elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:223) ~[elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:63) ~[elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:155) ~[elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:139) ~[elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:177) ~[elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:678) ~[elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:259) ~[elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:222) ~[elasticsearch-7.15.1.jar:7.15.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]
[2023-01-11T21:36:21,282][WARN ][o.e.c.a.s.ShardStateAction] [ess17-3]unexpected failure while sending request [internal:cluster/shard/failure] to [{esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}{xpack.installed=true, transform.node=false}] for shard entry [shard id [[.ds-myapp-logs-2023.01.09-000001][14]], allocation id [GeIVxR76TQOB0t_-iIRxNQ], primary term [12], message [failed to perform indices:data/write/bulk[s] on replica [.ds-myapp-logs-2023.01.09-000001][14], node[MlAaPdsBTNitGyNREgP1Vg], [R], s[STARTED], a[id=GeIVxR76TQOB0t_-iIRxNQ]], failure [RemoteTransportException[[ess01-3][10.X.X6.73:9303][indices:data/write/bulk[s][r]]]; nested: IllegalStateException[[.ds-myapp-logs-2023.01.09-000001][14] operation primary term [12] is too old (current [13])]; ], markAsStale [true]]
org.elasticsearch.transport.RemoteTransportException: [esm03-master][10.X.X6.56:9303][internal:cluster/shard/failure]
Caused by: org.elasticsearch.cluster.action.shard.ShardStateAction$NoLongerPrimaryShardException: primary term [12] did not match current primary term [13]
	at org.elasticsearch.cluster.action.shard.ShardStateAction$ShardFailedClusterStateTaskExecutor.execute(ShardStateAction.java:362) ~[elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:706) ~[elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:328) ~[elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:223) ~[elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:63) ~[elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:155) ~[elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:139) ~[elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:177) ~[elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:678) ~[elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:259) ~[elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:222) ~[elasticsearch-7.15.1.jar:7.15.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]

I am trying to determine if this smells like some networking issue either with the master nodes not responding in time, or taking to long for DNS, or just overloaded and can't handle the request fast enough?

What is interesting as the cluster is almost generally 99% shard allocation, it seems like a node will freak out then it will go back to 90% allocation, it never seems to get to 100% especially during the day.

We have fully restarted the processes of all 3 masters and that hasn't seemed to fix the issue. We also have in some cases restarted the Elasticsearch process on the instances where they get stuck trying to find master.

Welcome to our community! :smiley:

What is the output from the _cluster/stats?pretty&human API?

Output is this:

{
  "_nodes" : {
    "total" : 51,
    "successful" : 51,
    "failed" : 0
  },
  "cluster_name" : "es-cluster",
  "cluster_uuid" : "kLilcwqJQEqyIb5yzzhjzQ",
  "timestamp" : 1673476774008,
  "status" : "red",
  "indices" : {
    "count" : 10758,
    "shards" : {
      "total" : 29599,
      "primaries" : 14809,
      "replication" : 0.9987169964210952,
      "index" : {
        "shards" : {
          "min" : 1,
          "max" : 64,
          "avg" : 2.7513478341699202
        },
        "primaries" : {
          "min" : 1,
          "max" : 32,
          "avg" : 1.3765569808514593
        },
        "replication" : {
          "min" : 0.0,
          "max" : 1.0,
          "avg" : 0.9994294542633869
        }
      }
    },
    "docs" : {
      "count" : 236054669015,
      "deleted" : 8537307
    },
    "store" : {
      "size_in_bytes" : 219975237928283,
      "total_data_set_size_in_bytes" : 219975237928283,
      "reserved_in_bytes" : 0
    },
    "fielddata" : {
      "memory_size_in_bytes" : 302503320,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size_in_bytes" : 270881299,
      "total_count" : 2600750,
      "hit_count" : 74742,
      "miss_count" : 2526008,
      "cache_size" : 30678,
      "cache_count" : 92952,
      "evictions" : 62274
    },
    "completion" : {
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 160925,
      "memory_in_bytes" : 1480678468,
      "terms_memory_in_bytes" : 1159501672,
      "stored_fields_memory_in_bytes" : 102279336,
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory_in_bytes" : 9480960,
      "points_memory_in_bytes" : 0,
      "doc_values_memory_in_bytes" : 209416500,
      "index_writer_memory_in_bytes" : 10949674716,
      "version_map_memory_in_bytes" : 16533535,
      "fixed_bit_set_memory_in_bytes" : 82518584,
      "max_unsafe_auto_id_timestamp" : 1673475773698,
      "file_sizes" : { }
    },
    "mappings" : {
      "field_types" : [
        {
          "name" : "binary",
          "count" : 10,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "boolean",
          "count" : 830,
          "index_count" : 282,
          "script_count" : 0
        },
        {
          "name" : "date",
          "count" : 42511,
          "index_count" : 10754,
          "script_count" : 0
        },
        {
          "name" : "date_nanos",
          "count" : 30,
          "index_count" : 30,
          "script_count" : 0
        },
        {
          "name" : "flattened",
          "count" : 13,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "float",
          "count" : 1590,
          "index_count" : 860,
          "script_count" : 0
        },
        {
          "name" : "geo_point",
          "count" : 8,
          "index_count" : 8,
          "script_count" : 0
        },
        {
          "name" : "geo_shape",
          "count" : 1,
          "index_count" : 1,
          "script_count" : 0
        },
        {
          "name" : "half_float",
          "count" : 56,
          "index_count" : 14,
          "script_count" : 0
        },
        {
          "name" : "integer",
          "count" : 1954,
          "index_count" : 525,
          "script_count" : 0
        },
        {
          "name" : "ip",
          "count" : 8,
          "index_count" : 8,
          "script_count" : 0
        },
        {
          "name" : "keyword",
          "count" : 320145,
          "index_count" : 10754,
          "script_count" : 0
        },
        {
          "name" : "long",
          "count" : 31377,
          "index_count" : 10521,
          "script_count" : 0
        },
        {
          "name" : "nested",
          "count" : 39,
          "index_count" : 17,
          "script_count" : 0
        },
        {
          "name" : "object",
          "count" : 24007,
          "index_count" : 1771,
          "script_count" : 0
        },
        {
          "name" : "text",
          "count" : 15888,
          "index_count" : 9684,
          "script_count" : 0
        },
        {
          "name" : "version",
          "count" : 4,
          "index_count" : 4,
          "script_count" : 0
        }
      ],
      "runtime_field_types" : [ ]
    },
    "analysis" : {
      "char_filter_types" : [ ],
      "tokenizer_types" : [ ],
      "filter_types" : [ ],
      "analyzer_types" : [ ],
      "built_in_char_filters" : [ ],
      "built_in_tokenizers" : [ ],
      "built_in_filters" : [ ],
      "built_in_analyzers" : [ ]
    },
    "versions" : [
      {
        "version" : "6.8.3",
        "index_count" : 2,
        "primary_shard_count" : 2,
        "total_primary_bytes" : 43243
      },
      {
        "version" : "7.8.1",
        "index_count" : 12,
        "primary_shard_count" : 44,
        "total_primary_bytes" : 21419562435
      },
      {
        "version" : "7.15.1",
        "index_count" : 10745,
        "primary_shard_count" : 14766,
        "total_primary_bytes" : 110125718913080
      }
    ]
  },
  "nodes" : {
    "count" : {
      "total" : 51,
      "coordinating_only" : 1,
      "data" : 0,
      "data_cold" : 0,
      "data_content" : 29,
      "data_frozen" : 0,
      "data_hot" : 18,
      "data_warm" : 29,
      "ingest" : 29,
      "master" : 3,
      "ml" : 0,
      "remote_cluster_client" : 0,
      "transform" : 0,
      "voting_only" : 0
    },
    "versions" : [
      "7.15.1"
    ],
    "os" : {
      "available_processors" : 3408,
      "allocated_processors" : 1201,
      "names" : [
        {
          "name" : "Linux",
          "count" : 51
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "CentOS Linux 7 (Core)",
          "count" : 51
        }
      ],
      "architectures" : [
        {
          "arch" : "amd64",
          "count" : 51
        }
      ],
      "mem" : {
        "total_in_bytes" : 10941438787584,
        "free_in_bytes" : 1016357142528,
        "used_in_bytes" : 9925081645056,
        "free_percent" : 9,
        "used_percent" : 91
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 119
      },
      "open_file_descriptors" : {
        "min" : 1495,
        "max" : 6977,
        "avg" : 4579
      }
    },
    "jvm" : {
      "max_uptime_in_millis" : 35409108762,
      "versions" : [
        {
          "version" : "11.0.6",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "11.0.6+10-LTS",
          "vm_vendor" : "Oracle Corporation",
          "bundled_jdk" : true,
          "using_bundled_jdk" : false,
          "count" : 47
        },
        {
          "version" : "11.0.12",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "11.0.12+7-LTS",
          "vm_vendor" : "Red Hat, Inc.",
          "bundled_jdk" : true,
          "using_bundled_jdk" : false,
          "count" : 4
        }
      ],
      "mem" : {
        "heap_used_in_bytes" : 644166227048,
        "heap_max_in_bytes" : 1537009516544
      },
      "threads" : 9840
    },
    "fs" : {
      "total_in_bytes" : 989546098245632,
      "free_in_bytes" : 768544304394240,
      "available_in_bytes" : 718833401200640
    },
    "plugins" : [ ],
    "network_types" : {
      "transport_types" : {
        "netty4" : 51
      },
      "http_types" : {
        "netty4" : 51
      }
    },
    "discovery_types" : {
      "zen" : 51
    },
    "packaging_types" : [
      {
        "flavor" : "default",
        "type" : "rpm",
        "count" : 51
      }
    ],
    "ingest" : {
      "number_of_pipelines" : 11,
      "processor_stats" : {
        "conditional" : {
          "count" : 16253,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 773
        },
        "geoip" : {
          "count" : 15360,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 233
        },
        "gsub" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "pipeline" : {
          "count" : 61440,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 1464
        },
        "rename" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "script" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "set" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "user_agent" : {
          "count" : 15360,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 249
        }
      }
    }
  }
}

Can you clarify how many data nodes you actually have; it looks like 47?

If so you're probably pushing your heap a bit too much given the number of shards versus the actual data size you have. And so at first glance this looks like a typical case of over sharding and creating heap pressure causing nodes to drop out.

I believe yes there is 48 nodes in total, so i'm sure we are pushing HEAP limits, i know our sharding is a bit out of control. We are trying to work to get that way down, in the mean-time do you have any suggestions on what we should best do to get the cluster in a happier state?

Any suggestions on this, i know there are setings where you can manually increase the queue, not sure what the suggestion is here, or what would be best?

Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.ingest.IngestService$3@60f092ec on EsThreadPoolExecutor[name = ess30-3/write, queue capacity = 1000,

[2023-01-12T17:09:58,974][WARN ][o.e.x.m.MonitoringService] [ess30-3]monitoring execution failed
org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulks
	at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.lambda$doFlush$0(ExportBulk.java:110) [x-pack-monitoring-7.15.1.jar:7.15.1]
	at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:142) [elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$doFlush$1(LocalBulk.java:113) [x-pack-monitoring-7.15.1.jar:7.15.1]
	at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:142) [elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.action.support.ContextPreservingActionListener.onFailure(ContextPreservingActionListener.java:38) [elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:92) [elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.action.ActionListener$Delegating.onFailure(ActionListener.java:66) [elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.action.ActionListener$RunBeforeActionListener.onFailure(ActionListener.java:397) [elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.action.bulk.TransportBulkAction.lambda$processBulkIndexIngestRequest$4(TransportBulkAction.java:693) [elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.ingest.IngestService$3.onFailure(IngestService.java:454) [elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.onRejection(AbstractRunnable.java:52) [elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.onRejection(ThreadContext.java:730) [elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:79) [elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.ingest.IngestService.executeBulkRequest(IngestService.java:450) [elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.action.bulk.TransportBulkAction.processBulkIndexIngestRequest(TransportBulkAction.java:686) [elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.action.bulk.TransportBulkAction.doInternalExecute(TransportBulkAction.java:206) [elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.action.bulk.TransportBulkAction.doExecute(TransportBulkAction.java:162) [elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.action.bulk.TransportBulkAction.doExecute(TransportBulkAction.java:89) [elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:173) [elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.action.support.ActionFilter$Simple.apply(ActionFilter.java:42) [elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:171) [elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:149) [elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:77) [elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.client.node.NodeClient.executeLocally(NodeClient.java:90) [elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:70) [elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:402) [elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.client.support.AbstractClient.bulk(AbstractClient.java:484) [elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.xpack.core.ClientHelper.executeAsyncWithOrigin(ClientHelper.java:119) [x-pack-core-7.15.1.jar:7.15.1]
	at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.doFlush(LocalBulk.java:106) [x-pack-monitoring-7.15.1.jar:7.15.1]
	at org.elasticsearch.xpack.monitoring.exporter.ExportBulk.flush(ExportBulk.java:64) [x-pack-monitoring-7.15.1.jar:7.15.1]
	at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.lambda$doFlush$1(ExportBulk.java:108) [x-pack-monitoring-7.15.1.jar:7.15.1]
	at org.elasticsearch.xpack.core.common.IteratingActionListener.run(IteratingActionListener.java:103) [x-pack-core-7.15.1.jar:7.15.1]
	at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.doFlush(ExportBulk.java:124) [x-pack-monitoring-7.15.1.jar:7.15.1]
	at org.elasticsearch.xpack.monitoring.exporter.ExportBulk.flush(ExportBulk.java:64) [x-pack-monitoring-7.15.1.jar:7.15.1]
	at org.elasticsearch.xpack.monitoring.exporter.Exporters.doExport(Exporters.java:284) [x-pack-monitoring-7.15.1.jar:7.15.1]
	at org.elasticsearch.xpack.monitoring.exporter.Exporters.lambda$export$3(Exporters.java:259) [x-pack-monitoring-7.15.1.jar:7.15.1]
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:134) [elasticsearch-7.15.1.jar:7.15.1]
	at org.elasticsearch.xpack.monitoring.exporter.Exporters$AccumulatingExportBulkActionListener.delegateIfComplete(Exporters.java:364) [x-pack-monitoring-7.15.1.jar:7.15.1]
	at org.elasticsearch.xpack.monitoring.exporter.Exporters$AccumulatingExportBulkActionListener.onResponse(Exporters.java:343) [x-pack-monitoring-7.15.1.jar:7.15.1]
	at org.elasticsearch.xpack.monitoring.exporter.Exporters$AccumulatingExportBulkActionListener.onResponse(Exporters.java:314) [x-pack-monitoring-7.15.1.jar:7.15.1]
	at org.elasticsearch.xpack.monitoring.exporter.local.LocalExporter.openBulk(LocalExporter.java:243) [x-pack-monitoring-7.15.1.jar:7.15.1]
	at org.elasticsearch.xpack.monitoring.exporter.Exporters.wrapExportBulk(Exporters.java:245) [x-pack-monitoring-7.15.1.jar:7.15.1]
	at org.elasticsearch.xpack.monitoring.exporter.Exporters.export(Exporters.java:257) [x-pack-monitoring-7.15.1.jar:7.15.1]
	at org.elasticsearch.xpack.monitoring.MonitoringService$MonitoringExecution$1.doRun(MonitoringService.java:262) [x-pack-monitoring-7.15.1.jar:7.15.1]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.15.1.jar:7.15.1]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:678) [elasticsearch-7.15.1.jar:7.15.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulk [default_local]
	... 49 more
Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.ingest.IngestService$3@60f092ec on EsThreadPoolExecutor[name = ess30-3/write, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@4ab5821b[Running, pool size = 21, active threads = 21, queued tasks = 1105, completed tasks = 1400066]]
	at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:37) ~[elasticsearch-7.15.1.jar:7.15.1]
	at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355) ~[?:?]
	at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:73) ~[elasticsearch-7.15.1.jar:7.15.1]
	... 38 more

Ideally look at using reindex and shrink to reduce your shard count in the short term.

I wouldn't be doing that, it's only going to put more pressure onto your nodes.