I am trying to troubleshoot an Elasticsearch 7.15.1 cluster that has 3 master nodes, and 30+ data nodes. Recently we have seen where data nodes are complaining about "master not discovered yet" after the nodes have been running a while. When we start the data nodes they run fine for a while then get stupid after some time.
On the master we will see messages like this:
[2023-01-11T21:35:41,500][INFO ][o.e.c.s.ClusterApplierService] [esm02-master]added {{ess17-3}{9CKKdeNyTyiCNht2r9GS5A}{TEWP45M7TCeunwXh7D1bkA}{10.X.X6.250}{10.X.X6.250:9303}{isw}, {esh18-3}{Mm2c5CQZTZqim8VHGycoxQ}{6jbH9KriS3Sm9waH-wByCA}{10.X.X7.252}{10.X.X7.252:9303}{h}}, term: 244, version: 2814273, reason: ApplyCommitRequest{term=244, version=2814273, sourceNode={esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}{xpack.installed=true, transform.node=false}}
[2023-01-11T21:35:46,339][INFO ][o.e.c.s.ClusterApplierService] [esm02-master]removed {{ess17-3}{9CKKdeNyTyiCNht2r9GS5A}{TEWP45M7TCeunwXh7D1bkA}{10.X.X6.250}{10.X.X6.250:9303}{isw}, {esh18-3}{Mm2c5CQZTZqim8VHGycoxQ}{6jbH9KriS3Sm9waH-wByCA}{10.X.X7.252}{10.X.X7.252:9303}{h}}, term: 244, version: 2814274, reason: ApplyCommitRequest{term=244, version=2814274, sourceNode={esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}{xpack.installed=true, transform.node=false}}
[2023-01-11T21:35:48,490][INFO ][o.e.c.s.ClusterApplierService] [esm02-master]added {{ess17-3}{9CKKdeNyTyiCNht2r9GS5A}{TEWP45M7TCeunwXh7D1bkA}{10.X.X6.250}{10.X.X6.250:9303}{isw}, {esh18-3}{Mm2c5CQZTZqim8VHGycoxQ}{6jbH9KriS3Sm9waH-wByCA}{10.X.X7.252}{10.X.X7.252:9303}{h}}, term: 244, version: 2814276, reason: ApplyCommitRequest{term=244, version=2814276, sourceNode={esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}{xpack.installed=true, transform.node=false}}
[2023-01-11T21:35:53,744][INFO ][o.e.c.s.ClusterApplierService] [esm02-master]removed {{ess17-3}{9CKKdeNyTyiCNht2r9GS5A}{TEWP45M7TCeunwXh7D1bkA}{10.X.X6.250}{10.X.X6.250:9303}{isw}, {esh18-3}{Mm2c5CQZTZqim8VHGycoxQ}{6jbH9KriS3Sm9waH-wByCA}{10.X.X7.252}{10.X.X7.252:9303}{h}}, term: 244, version: 2814277, reason: ApplyCommitRequest{term=244, version=2814277, sourceNode={esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}{xpack.installed=true, transform.node=false}}
[2023-01-11T21:35:56,493][INFO ][o.e.c.s.ClusterApplierService] [esm02-master]added {{esh18-3}{Mm2c5CQZTZqim8VHGycoxQ}{6jbH9KriS3Sm9waH-wByCA}{10.X.X7.252}{10.X.X7.252:9303}{h}, {ess17-3}{9CKKdeNyTyiCNht2r9GS5A}{TEWP45M7TCeunwXh7D1bkA}{10.X.X6.250}{10.X.X6.250:9303}{isw}}, term: 244, version: 2814279, reason: ApplyCommitRequest{term=244, version=2814279, sourceNode={esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}{xpack.installed=true, transform.node=false}}
on the data nodes we will see stuff like this as well:
2023-01-11T21:35:20,836][WARN ][o.e.c.c.ClusterFormationFailureHelper] [ess17-3]master not discovered yet: have discovered [{ess17-3}{9CKKdeNyTyiCNht2r9GS5A}{TEWP45M7TCeunwXh7D1bkA}{10.X.X6.250}{10.X.X6.250:9303}{isw}, {esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}, {esm02-master}{Try9HeLLRgKVyjvd0yoZ2g}{NXS1t1hiQOiNF5EC9lJB8A}{10.X.X6.19}{10.X.X6.19:9303}{m}, {esm01-master}{C78EUk7_Tk-cHlNZUviuSg}{dblN4N_jTkGK-3m1QhuJ5A}{10.X.X6.55}{10.X.X6.55:9303}{m}]; discovery will continue using [10.X.X6.55:9303, 10.X.X6.19:9303, 10.X.X6.56:9303] from hosts providers and [{esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}, {esm02-master}{Try9HeLLRgKVyjvd0yoZ2g}{NXS1t1hiQOiNF5EC9lJB8A}{10.X.X6.19}{10.X.X6.19:9303}{m}, {esm01-master}{C78EUk7_Tk-cHlNZUviuSg}{dblN4N_jTkGK-3m1QhuJ5A}{10.X.X6.55}{10.X.X6.55:9303}{m}] from last-known cluster state; node term 244, last-accepted version 2814228 in term 244
[2023-01-11T21:35:30,899][WARN ][o.e.c.c.ClusterFormationFailureHelper] [ess17-3]master not discovered yet: have discovered [{ess17-3}{9CKKdeNyTyiCNht2r9GS5A}{TEWP45M7TCeunwXh7D1bkA}{10.X.X6.250}{10.X.X6.250:9303}{isw}, {esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}, {esm02-master}{Try9HeLLRgKVyjvd0yoZ2g}{NXS1t1hiQOiNF5EC9lJB8A}{10.X.X6.19}{10.X.X6.19:9303}{m}, {esm01-master}{C78EUk7_Tk-cHlNZUviuSg}{dblN4N_jTkGK-3m1QhuJ5A}{10.X.X6.55}{10.X.X6.55:9303}{m}]; discovery will continue using [10.X.X6.55:9303, 10.X.X6.19:9303, 10.X.X6.56:9303] from hosts providers and [{esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}, {esm02-master}{Try9HeLLRgKVyjvd0yoZ2g}{NXS1t1hiQOiNF5EC9lJB8A}{10.X.X6.19}{10.X.X6.19:9303}{m}, {esm01-master}{C78EUk7_Tk-cHlNZUviuSg}{dblN4N_jTkGK-3m1QhuJ5A}{10.X.X6.55}{10.X.X6.55:9303}{m}] from last-known cluster state; node term 244, last-accepted version 2814228 in term 244
[2023-01-11T21:35:40,965][WARN ][o.e.c.c.ClusterFormationFailureHelper] [ess17-3]master not discovered yet: have discovered [{ess17-3}{9CKKdeNyTyiCNht2r9GS5A}{TEWP45M7TCeunwXh7D1bkA}{10.X.X6.250}{10.X.X6.250:9303}{isw}, {esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}, {esm02-master}{Try9HeLLRgKVyjvd0yoZ2g}{NXS1t1hiQOiNF5EC9lJB8A}{10.X.X6.19}{10.X.X6.19:9303}{m}, {esm01-master}{C78EUk7_Tk-cHlNZUviuSg}{dblN4N_jTkGK-3m1QhuJ5A}{10.X.X6.55}{10.X.X6.55:9303}{m}]; discovery will continue using [10.X.X6.55:9303, 10.X.X6.19:9303, 10.X.X6.56:9303] from hosts providers and [{esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}, {esm02-master}{Try9HeLLRgKVyjvd0yoZ2g}{NXS1t1hiQOiNF5EC9lJB8A}{10.X.X6.19}{10.X.X6.19:9303}{m}, {esm01-master}{C78EUk7_Tk-cHlNZUviuSg}{dblN4N_jTkGK-3m1QhuJ5A}{10.X.X6.55}{10.X.X6.55:9303}{m}] from last-known cluster state; node term 244, last-accepted version 2814228 in term 244
[2023-01-11T21:35:51,032][WARN ][o.e.c.c.ClusterFormationFailureHelper] [ess17-3]master not discovered yet: have discovered [{ess17-3}{9CKKdeNyTyiCNht2r9GS5A}{TEWP45M7TCeunwXh7D1bkA}{10.X.X6.250}{10.X.X6.250:9303}{isw}, {esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}, {esm02-master}{Try9HeLLRgKVyjvd0yoZ2g}{NXS1t1hiQOiNF5EC9lJB8A}{10.X.X6.19}{10.X.X6.19:9303}{m}, {esm01-master}{C78EUk7_Tk-cHlNZUviuSg}{dblN4N_jTkGK-3m1QhuJ5A}{10.X.X6.55}{10.X.X6.55:9303}{m}]; discovery will continue using [10.X.X6.55:9303, 10.X.X6.19:9303, 10.X.X6.56:9303] from hosts providers and [{esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}, {esm02-master}{Try9HeLLRgKVyjvd0yoZ2g}{NXS1t1hiQOiNF5EC9lJB8A}{10.X.X6.19}{10.X.X6.19:9303}{m}, {esm01-master}{C78EUk7_Tk-cHlNZUviuSg}{dblN4N_jTkGK-3m1QhuJ5A}{10.X.X6.55}{10.X.X6.55:9303}{m}] from last-known cluster state; node term 244, last-accepted version 2814228 in term 244
[2023-01-11T21:35:53,987][INFO ][o.e.c.c.JoinHelper ] [ess17-3]failed to join {esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}{xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={ess17-3}{9CKKdeNyTyiCNht2r9GS5A}{TEWP45M7TCeunwXh7D1bkA}{10.X.X6.250}{10.X.X6.250:9303}{isw}{box-type=warm, xpack.installed=true, transform.node=false, host=ess17, gateway=true}, minimumTerm=244, optionalJoin=Optional.empty}
org.elasticsearch.transport.RemoteTransportException: [esm03-master][10.X.X6.56:9303][internal:cluster/coordination/join]
Caused by: java.lang.IllegalStateException: failure when sending a validation request to node
at org.elasticsearch.cluster.coordination.Coordinator$2.onFailure(Coordinator.java:509) ~[elasticsearch-7.15.1.jar:7.15.1]
at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:48) ~[elasticsearch-7.15.1.jar:7.15.1]
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1289) ~[elasticsearch-7.15.1.jar:7.15.1]
at org.elasticsearch.transport.TransportService$8.run(TransportService.java:1151) ~[elasticsearch-7.15.1.jar:7.15.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:678) ~[elasticsearch-7.15.1.jar:7.15.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: org.elasticsearch.transport.NodeDisconnectedException: [ess17-3][10.X.X6.250:9303][internal:cluster/coordination/join/validate] disconnected
[2023-01-11T21:35:53,994][INFO ][o.e.c.c.JoinHelper ] [ess17-3]failed to join {esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}{xpack.installed=true, transform.node=false} with JoinRequest{sourceNode={ess17-3}{9CKKdeNyTyiCNht2r9GS5A}{TEWP45M7TCeunwXh7D1bkA}{10.X.X6.250}{10.X.X6.250:9303}{isw}{box-type=warm, xpack.installed=true, transform.node=false, host=ess17, gateway=true}, minimumTerm=244, optionalJoin=Optional.empty}
org.elasticsearch.transport.RemoteTransportException: [esm03-master][10.X.X6.56:9303][internal:cluster/coordination/join]
Caused by: java.lang.IllegalStateException: failure when sending a validation request to node
at org.elasticsearch.cluster.coordination.Coordinator$2.onFailure(Coordinator.java:509) ~[elasticsearch-7.15.1.jar:7.15.1]
at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:48) ~[elasticsearch-7.15.1.jar:7.15.1]
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1289) ~[elasticsearch-7.15.1.jar:7.15.1]
at org.elasticsearch.transport.TransportService$8.run(TransportService.java:1151) ~[elasticsearch-7.15.1.jar:7.15.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:678) ~[elasticsearch-7.15.1.jar:7.15.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: org.elasticsearch.transport.NodeDisconnectedException: [ess17-3][10.X.X6.250:9303][internal:cluster/coordination/join/validate] disconnected
[2023-01-11T21:35:56,977][INFO ][o.e.c.s.ClusterApplierService] [ess17-3]master node changed {previous [], current [{esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}]}, term: 244, version: 2814279, reason: ApplyCommitRequest{term=244, version=2814279, sourceNode={esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}{xpack.installed=true, transform.node=false}}
[2023-01-11T21:36:00,882][INFO ][o.e.c.s.ClusterApplierService] [ess17-3]removed {{esh18-3}{Mm2c5CQZTZqim8VHGycoxQ}{6jbH9KriS3Sm9waH-wByCA}{10.X.X7.252}{10.X.X7.252:9303}{h}}, term: 244, version: 2814280, reason: ApplyCommitRequest{term=244, version=2814280, sourceNode={esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}{xpack.installed=true, transform.node=false}}
[2023-01-11T21:36:04,364][INFO ][o.e.c.s.ClusterApplierService] [ess17-3]added {{esh18-3}{Mm2c5CQZTZqim8VHGycoxQ}{6jbH9KriS3Sm9waH-wByCA}{10.X.X7.252}{10.X.X7.252:9303}{h}}, term: 244, version: 2814283, reason: ApplyCommitRequest{term=244, version=2814283, sourceNode={esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}{xpack.installed=true, transform.node=false}}
At times we will get messages on the data node like this as well:
[2023-01-11T21:36:21,281][WARN ][o.e.c.a.s.ShardStateAction] [ess17-3]unexpected failure while sending request [internal:cluster/shard/failure] to [{esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}{xpack.installed=true, transform.node=false}] for shard entry [shard id [[.ds-myapp2-logs-2023.01.09-000001][7]], allocation id [CfcA8xrgTEy8fcKc0wCK5Q], primary term [11], message [failed to perform indices:data/write/bulk[s] on replica [.ds-myapp2-logs-2023.01.09-000001][7], node[CP8JKRRHRI2WOdjU6RFDzw], [R], s[STARTED], a[id=CfcA8xrgTEy8fcKc0wCK5Q]], failure [RemoteTransportException[[ess09-3][10.X.X6.158:9303][indices:data/write/bulk[s][r]]]; nested: IllegalStateException[[.ds-myapp2-logs-2023.01.09-000001][7] operation primary term [11] is too old (current [12])]; ], markAsStale [true]]
org.elasticsearch.transport.RemoteTransportException: [esm03-master][10.X.X6.56:9303][internal:cluster/shard/failure]
Caused by: org.elasticsearch.cluster.action.shard.ShardStateAction$NoLongerPrimaryShardException: primary term [11] did not match current primary term [12]
at org.elasticsearch.cluster.action.shard.ShardStateAction$ShardFailedClusterStateTaskExecutor.execute(ShardStateAction.java:362) ~[elasticsearch-7.15.1.jar:7.15.1]
at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:706) ~[elasticsearch-7.15.1.jar:7.15.1]
at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:328) ~[elasticsearch-7.15.1.jar:7.15.1]
at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:223) ~[elasticsearch-7.15.1.jar:7.15.1]
at org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:63) ~[elasticsearch-7.15.1.jar:7.15.1]
at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:155) ~[elasticsearch-7.15.1.jar:7.15.1]
at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:139) ~[elasticsearch-7.15.1.jar:7.15.1]
at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:177) ~[elasticsearch-7.15.1.jar:7.15.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:678) ~[elasticsearch-7.15.1.jar:7.15.1]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:259) ~[elasticsearch-7.15.1.jar:7.15.1]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:222) ~[elasticsearch-7.15.1.jar:7.15.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]
[2023-01-11T21:36:21,282][WARN ][o.e.c.a.s.ShardStateAction] [ess17-3]unexpected failure while sending request [internal:cluster/shard/failure] to [{esm03-master}{8m7sut3QRyOypTs5tfNA5g}{sPJgwhWXR7m-hDbaeCNwTQ}{10.X.X6.56}{10.X.X6.56:9303}{m}{xpack.installed=true, transform.node=false}] for shard entry [shard id [[.ds-myapp-logs-2023.01.09-000001][14]], allocation id [GeIVxR76TQOB0t_-iIRxNQ], primary term [12], message [failed to perform indices:data/write/bulk[s] on replica [.ds-myapp-logs-2023.01.09-000001][14], node[MlAaPdsBTNitGyNREgP1Vg], [R], s[STARTED], a[id=GeIVxR76TQOB0t_-iIRxNQ]], failure [RemoteTransportException[[ess01-3][10.X.X6.73:9303][indices:data/write/bulk[s][r]]]; nested: IllegalStateException[[.ds-myapp-logs-2023.01.09-000001][14] operation primary term [12] is too old (current [13])]; ], markAsStale [true]]
org.elasticsearch.transport.RemoteTransportException: [esm03-master][10.X.X6.56:9303][internal:cluster/shard/failure]
Caused by: org.elasticsearch.cluster.action.shard.ShardStateAction$NoLongerPrimaryShardException: primary term [12] did not match current primary term [13]
at org.elasticsearch.cluster.action.shard.ShardStateAction$ShardFailedClusterStateTaskExecutor.execute(ShardStateAction.java:362) ~[elasticsearch-7.15.1.jar:7.15.1]
at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:706) ~[elasticsearch-7.15.1.jar:7.15.1]
at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:328) ~[elasticsearch-7.15.1.jar:7.15.1]
at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:223) ~[elasticsearch-7.15.1.jar:7.15.1]
at org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:63) ~[elasticsearch-7.15.1.jar:7.15.1]
at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:155) ~[elasticsearch-7.15.1.jar:7.15.1]
at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:139) ~[elasticsearch-7.15.1.jar:7.15.1]
at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:177) ~[elasticsearch-7.15.1.jar:7.15.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:678) ~[elasticsearch-7.15.1.jar:7.15.1]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:259) ~[elasticsearch-7.15.1.jar:7.15.1]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:222) ~[elasticsearch-7.15.1.jar:7.15.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]
I am trying to determine if this smells like some networking issue either with the master nodes not responding in time, or taking to long for DNS, or just overloaded and can't handle the request fast enough?
What is interesting as the cluster is almost generally 99% shard allocation, it seems like a node will freak out then it will go back to 90% allocation, it never seems to get to 100% especially during the day.
We have fully restarted the processes of all 3 masters and that hasn't seemed to fix the issue. We also have in some cases restarted the Elasticsearch process on the instances where they get stuck trying to find master.