Hi,
I have a 51 node elasticsearch cluster with 48 data nodes and 3 master nodes. ES version is 7.1.1. In one of the cases, I saw that 4 nodes were continuously out of the cluster for a really long time just after master switch. They joined and then got disconnected in the next 3 seconds. It is interesting that the all 4 nodes are getting disconnected almost at the same time. Post this, the cycle continues every 10 seconds.
I have eliminated hardware and network issues for these nodes. These nodes were fine till the master was older and when the master switched rest of the 47 nodes were able to join the cluster. Once I restarted the nodes and master, the issue self-corrected. Hence, I am suspecting a software bug and not hardware issue. I had only the info logs enabled and hence don't have much to go on here. Unfortunately, there are no dumps as well. Any help would be highly appreciated.
Logs on master node:
2020-08-18T22:47:11,357][INFO ][o.e.c.s.MasterService ] [b443e65c6dcfdab8ca3383d9e6fb6267] node-join[{5d211c055f1cee31c120d23f95c958b2}{UiD7LgTyQcy3kzTAfKG7RQ}{O1QSMnBHT_y_bxQT_u9VdQ}{X.X.X.X}{X.X.X.X:9300}X.X.X.X join existing leader, {eba50cd36d10a05f3da30a612a2d9b4e}{Qyh9JgRESzGuw4WUu6MSMQ}{-y0yFcfFS2S5hI-i-PQmwg}{X.X.X.X}{X.X.X.X:9300}X.X.X.X join existing leader, {20eb7db1dd823cec115dc04c8fd525da}{FmulCOttTt6S_8Ymn1dpuQ}{erKnMqWURhKJYVJMdVDv2Q}{X.X.X.X}{X.X.X.X:9300}X.X.X.X join existing leader, {265a1c0cbc47542c74dd3b16402ff8fd}{wKZH_U0OT0KHEbyDXtHd6A}{1T35_yN0RjaoL5aU2ZIBwA}{X.X.X.X}{X.X.X.X:9300}X.X.X.X join existing leader], term: 20345, version: 2310375, reason: added {{265a1c0cbc47542c74dd3b16402ff8fd}{wKZH_U0OT0KHEbyDXtHd6A}{1T35_yN0RjaoL5aU2ZIBwA}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,{5d211c055f1cee31c120d23f95c958b2}{UiD7LgTyQcy3kzTAfKG7RQ}{O1QSMnBHT_y_bxQT_u9VdQ}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,{eba50cd36d10a05f3da30a612a2d9b4e}{Qyh9JgRESzGuw4WUu6MSMQ}{-y0yFcfFS2S5hI-i-PQmwg}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,{20eb7db1dd823cec115dc04c8fd525da}{FmulCOttTt6S_8Ymn1dpuQ}{erKnMqWURhKJYVJMdVDv2Q}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,}
[2020-08-18T22:47:12,664][INFO ][o.e.c.s.ClusterApplierService] [b443e65c6dcfdab8ca3383d9e6fb6267] added {{265a1c0cbc47542c74dd3b16402ff8fd}{wKZH_U0OT0KHEbyDXtHd6A}{1T35_yN0RjaoL5aU2ZIBwA}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,{5d211c055f1cee31c120d23f95c958b2}{UiD7LgTyQcy3kzTAfKG7RQ}{O1QSMnBHT_y_bxQT_u9VdQ}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,{eba50cd36d10a05f3da30a612a2d9b4e}{Qyh9JgRESzGuw4WUu6MSMQ}{-y0yFcfFS2S5hI-i-PQmwg}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,{20eb7db1dd823cec115dc04c8fd525da}{FmulCOttTt6S_8Ymn1dpuQ}{erKnMqWURhKJYVJMdVDv2Q}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,}, term: 20345, version: 2310375, reason: Publication{term=20345, version=2310375}
[2020-08-18T22:47:14,974][INFO ][o.e.c.s.MasterService ] [b443e65c6dcfdab8ca3383d9e6fb6267] node-left[{265a1c0cbc47542c74dd3b16402ff8fd}{wKZH_U0OT0KHEbyDXtHd6A}{1T35_yN0RjaoL5aU2ZIBwA}{X.X.X.X}{X.X.X.X:9300}X.X.X.X disconnected, {5d211c055f1cee31c120d23f95c958b2}{UiD7LgTyQcy3kzTAfKG7RQ}{O1QSMnBHT_y_bxQT_u9VdQ}{X.X.X.X}{X.X.X.X:9300}X.X.X.X disconnected, {20eb7db1dd823cec115dc04c8fd525da}{FmulCOttTt6S_8Ymn1dpuQ}{erKnMqWURhKJYVJMdVDv2Q}{X.X.X.X}{X.X.X.X:9300}X.X.X.X disconnected, {eba50cd36d10a05f3da30a612a2d9b4e}{Qyh9JgRESzGuw4WUu6MSMQ}{-y0yFcfFS2S5hI-i-PQmwg}{X.X.X.X}{X.X.X.X:9300}X.X.X.X disconnected], term: 20345, version: 2310376, reason: removed {{265a1c0cbc47542c74dd3b16402ff8fd}{wKZH_U0OT0KHEbyDXtHd6A}{1T35_yN0RjaoL5aU2ZIBwA}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,{5d211c055f1cee31c120d23f95c958b2}{UiD7LgTyQcy3kzTAfKG7RQ}{O1QSMnBHT_y_bxQT_u9VdQ}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,{eba50cd36d10a05f3da30a612a2d9b4e}{Qyh9JgRESzGuw4WUu6MSMQ}{-y0yFcfFS2S5hI-i-PQmwg}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,{20eb7db1dd823cec115dc04c8fd525da}{FmulCOttTt6S_8Ymn1dpuQ}{erKnMqWURhKJYVJMdVDv2Q}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,}
[2020-08-18T22:47:15,102][INFO ][o.e.c.s.ClusterApplierService] [b443e65c6dcfdab8ca3383d9e6fb6267] removed {{265a1c0cbc47542c74dd3b16402ff8fd}{wKZH_U0OT0KHEbyDXtHd6A}{1T35_yN0RjaoL5aU2ZIBwA}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,{5d211c055f1cee31c120d23f95c958b2}{UiD7LgTyQcy3kzTAfKG7RQ}{O1QSMnBHT_y_bxQT_u9VdQ}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,{eba50cd36d10a05f3da30a612a2d9b4e}{Qyh9JgRESzGuw4WUu6MSMQ}{-y0yFcfFS2S5hI-i-PQmwg}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,{20eb7db1dd823cec115dc04c8fd525da}{FmulCOttTt6S_8Ymn1dpuQ}{erKnMqWURhKJYVJMdVDv2Q}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,}, term: 20345, version: 2310376, reason: Publication{term=20345, version=2310376}
[2020-08-18T22:47:17,876][WARN ][o.e.g.G.InternalPrimaryShardAllocator] [b443e65c6dcfdab8ca3383d9e6fb6267] [X.X.X.X][0]: failed to list shard for shard_started on node [UiD7LgTyQcy3kzTAfKG7RQ]
org.elasticsearch.action.FailedNodeException: Failed node [UiD7LgTyQcy3kzTAfKG7RQ]
at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:223) [elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$100(TransportNodesAction.java:142) [elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleException(TransportNodesAction.java:198) [elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:534) [elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.start(TransportNodesAction.java:182) [elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.action.support.nodes.TransportNodesAction.doExecute(TransportNodesAction.java:82) [elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.action.support.nodes.TransportNodesAction.doExecute(TransportNodesAction.java:51) [elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:146) [elasticsearch-7.1.1.jar:7.1.1]
X.X.X.X
at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:144) [elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:122) [elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:65) [elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.gateway.TransportNodesListGatewayStartedShards.list(TransportNodesListGatewayStartedShards.java:91) [elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.gateway.AsyncShardFetch.asyncFetch(AsyncShardFetch.java:283) [elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.gateway.AsyncShardFetch.fetchData(AsyncShardFetch.java:126) [elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.gateway.GatewayAllocator$InternalPrimaryShardAllocator.fetchData(GatewayAllocator.java:159) [elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.gateway.PrimaryShardAllocator.makeAllocationDecision(PrimaryShardAllocator.java:86) [elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.gateway.BaseGatewayShardAllocator.allocateUnassigned(BaseGatewayShardAllocator.java:59) [elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.gateway.GatewayAllocator.innerAllocatedUnassigned(GatewayAllocator.java:114) [elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.gateway.GatewayAllocator.allocateUnassigned(GatewayAllocator.java:104) [elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:410) [elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:378) [elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:361) [elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.cluster.coordination.JoinTaskExecutor.execute(JoinTaskExecutor.java:155) [elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.cluster.coordination.JoinHelper$1.execute(JoinHelper.java:118) [elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:687) [elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:310) [elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:210) [elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:142) [elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:690) [elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.1.1.jar:7.1.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [5d211c055f1cee31c120d23f95c958b2][X.X.X.X:9300] Node not connected
at org.elasticsearch.transport.ConnectionManager.getConnection(ConnectionManager.java:151) ~[elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:558) ~[elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:530) ~[elasticsearch-7.1.1.jar:7.1.1]
... 33 more
Logs on non faulty data nodes:
[2020-08-18T22:47:11,579][INFO ][o.e.c.s.ClusterApplierService] [e60b1892bf32ee58bd969a3489c8d902] added {{20eb7db1dd823cec115dc04c8fd525da}{FmulCOttTt6S_8Ymn1dpuQ}{erKnMqWURhKJYVJMdVDv2Q}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,{265a1c0cbc47542c74dd3b16402ff8fd}{wKZH_U0OT0KHEbyDXtHd6A}{1T35_yN0RjaoL5aU2ZIBwA}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,{eba50cd36d10a05f3da30a612a2d9b4e}{Qyh9JgRESzGuw4WUu6MSMQ}{-y0yFcfFS2S5hI-i-PQmwg}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,{5d211c055f1cee31c120d23f95c958b2}{UiD7LgTyQcy3kzTAfKG7RQ}{O1QSMnBHT_y_bxQT_u9VdQ}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,}, term: 20345, version: 2310375, reason: ApplyCommitRequest{term=20345, version=2310375, sourceNode={b443e65c6dcfdab8ca3383d9e6fb6267}{kB4q0qa0S8ukCyfYCKBUKA}{cJvK-jXFSl2e6wds8Egh8Q}{X.X.X.X}{X.X.X.X:9300}X.X.X.X}
[2020-08-18T22:47:14,995][INFO ][o.e.c.s.ClusterApplierService] [e60b1892bf32ee58bd969a3489c8d902] removed {{20eb7db1dd823cec115dc04c8fd525da}{FmulCOttTt6S_8Ymn1dpuQ}{erKnMqWURhKJYVJMdVDv2Q}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,{265a1c0cbc47542c74dd3b16402ff8fd}{wKZH_U0OT0KHEbyDXtHd6A}{1T35_yN0RjaoL5aU2ZIBwA}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,{eba50cd36d10a05f3da30a612a2d9b4e}{Qyh9JgRESzGuw4WUu6MSMQ}{-y0yFcfFS2S5hI-i-PQmwg}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,{5d211c055f1cee31c120d23f95c958b2}{UiD7LgTyQcy3kzTAfKG7RQ}{O1QSMnBHT_y_bxQT_u9VdQ}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,}, term: 20345, version: 2310376, reason: ApplyCommitRequest{term=20345, version=2310376, sourceNode={b443e65c6dcfdab8ca3383d9e6fb6267}{kB4q0qa0S8ukCyfYCKBUKA}{cJvK-jXFSl2e6wds8Egh8Q}{X.X.X.X}{X.X.X.X:9300}X.X.X.X}
On the faulty data nodes, we only have formation helper logs and nothing else. Even join failure logs don't appear on data nodes. No cluster state seems to be published. But we also don't see any join failures around that time, neither any publication failures.