Nodes continuously leaving and rejoining the cluster in 7.1 cluster after master switch

Hi,

I have a 51 node elasticsearch cluster with 48 data nodes and 3 master nodes. ES version is 7.1.1. In one of the cases, I saw that 4 nodes were continuously out of the cluster for a really long time just after master switch. They joined and then got disconnected in the next 3 seconds. It is interesting that the all 4 nodes are getting disconnected almost at the same time. Post this, the cycle continues every 10 seconds.

I have eliminated hardware and network issues for these nodes. These nodes were fine till the master was older and when the master switched rest of the 47 nodes were able to join the cluster. Once I restarted the nodes and master, the issue self-corrected. Hence, I am suspecting a software bug and not hardware issue. I had only the info logs enabled and hence don't have much to go on here. Unfortunately, there are no dumps as well. Any help would be highly appreciated.
Logs on master node:

2020-08-18T22:47:11,357][INFO ][o.e.c.s.MasterService    ] [b443e65c6dcfdab8ca3383d9e6fb6267] node-join[{5d211c055f1cee31c120d23f95c958b2}{UiD7LgTyQcy3kzTAfKG7RQ}{O1QSMnBHT_y_bxQT_u9VdQ}{X.X.X.X}{X.X.X.X:9300}X.X.X.X join existing leader, {eba50cd36d10a05f3da30a612a2d9b4e}{Qyh9JgRESzGuw4WUu6MSMQ}{-y0yFcfFS2S5hI-i-PQmwg}{X.X.X.X}{X.X.X.X:9300}X.X.X.X join existing leader, {20eb7db1dd823cec115dc04c8fd525da}{FmulCOttTt6S_8Ymn1dpuQ}{erKnMqWURhKJYVJMdVDv2Q}{X.X.X.X}{X.X.X.X:9300}X.X.X.X join existing leader, {265a1c0cbc47542c74dd3b16402ff8fd}{wKZH_U0OT0KHEbyDXtHd6A}{1T35_yN0RjaoL5aU2ZIBwA}{X.X.X.X}{X.X.X.X:9300}X.X.X.X join existing leader], term: 20345, version: 2310375, reason: added {{265a1c0cbc47542c74dd3b16402ff8fd}{wKZH_U0OT0KHEbyDXtHd6A}{1T35_yN0RjaoL5aU2ZIBwA}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,{5d211c055f1cee31c120d23f95c958b2}{UiD7LgTyQcy3kzTAfKG7RQ}{O1QSMnBHT_y_bxQT_u9VdQ}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,{eba50cd36d10a05f3da30a612a2d9b4e}{Qyh9JgRESzGuw4WUu6MSMQ}{-y0yFcfFS2S5hI-i-PQmwg}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,{20eb7db1dd823cec115dc04c8fd525da}{FmulCOttTt6S_8Ymn1dpuQ}{erKnMqWURhKJYVJMdVDv2Q}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,}
[2020-08-18T22:47:12,664][INFO ][o.e.c.s.ClusterApplierService] [b443e65c6dcfdab8ca3383d9e6fb6267] added {{265a1c0cbc47542c74dd3b16402ff8fd}{wKZH_U0OT0KHEbyDXtHd6A}{1T35_yN0RjaoL5aU2ZIBwA}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,{5d211c055f1cee31c120d23f95c958b2}{UiD7LgTyQcy3kzTAfKG7RQ}{O1QSMnBHT_y_bxQT_u9VdQ}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,{eba50cd36d10a05f3da30a612a2d9b4e}{Qyh9JgRESzGuw4WUu6MSMQ}{-y0yFcfFS2S5hI-i-PQmwg}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,{20eb7db1dd823cec115dc04c8fd525da}{FmulCOttTt6S_8Ymn1dpuQ}{erKnMqWURhKJYVJMdVDv2Q}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,}, term: 20345, version: 2310375, reason: Publication{term=20345, version=2310375}
[2020-08-18T22:47:14,974][INFO ][o.e.c.s.MasterService    ] [b443e65c6dcfdab8ca3383d9e6fb6267] node-left[{265a1c0cbc47542c74dd3b16402ff8fd}{wKZH_U0OT0KHEbyDXtHd6A}{1T35_yN0RjaoL5aU2ZIBwA}{X.X.X.X}{X.X.X.X:9300}X.X.X.X disconnected, {5d211c055f1cee31c120d23f95c958b2}{UiD7LgTyQcy3kzTAfKG7RQ}{O1QSMnBHT_y_bxQT_u9VdQ}{X.X.X.X}{X.X.X.X:9300}X.X.X.X disconnected, {20eb7db1dd823cec115dc04c8fd525da}{FmulCOttTt6S_8Ymn1dpuQ}{erKnMqWURhKJYVJMdVDv2Q}{X.X.X.X}{X.X.X.X:9300}X.X.X.X disconnected, {eba50cd36d10a05f3da30a612a2d9b4e}{Qyh9JgRESzGuw4WUu6MSMQ}{-y0yFcfFS2S5hI-i-PQmwg}{X.X.X.X}{X.X.X.X:9300}X.X.X.X disconnected], term: 20345, version: 2310376, reason: removed {{265a1c0cbc47542c74dd3b16402ff8fd}{wKZH_U0OT0KHEbyDXtHd6A}{1T35_yN0RjaoL5aU2ZIBwA}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,{5d211c055f1cee31c120d23f95c958b2}{UiD7LgTyQcy3kzTAfKG7RQ}{O1QSMnBHT_y_bxQT_u9VdQ}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,{eba50cd36d10a05f3da30a612a2d9b4e}{Qyh9JgRESzGuw4WUu6MSMQ}{-y0yFcfFS2S5hI-i-PQmwg}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,{20eb7db1dd823cec115dc04c8fd525da}{FmulCOttTt6S_8Ymn1dpuQ}{erKnMqWURhKJYVJMdVDv2Q}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,}
[2020-08-18T22:47:15,102][INFO ][o.e.c.s.ClusterApplierService] [b443e65c6dcfdab8ca3383d9e6fb6267] removed {{265a1c0cbc47542c74dd3b16402ff8fd}{wKZH_U0OT0KHEbyDXtHd6A}{1T35_yN0RjaoL5aU2ZIBwA}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,{5d211c055f1cee31c120d23f95c958b2}{UiD7LgTyQcy3kzTAfKG7RQ}{O1QSMnBHT_y_bxQT_u9VdQ}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,{eba50cd36d10a05f3da30a612a2d9b4e}{Qyh9JgRESzGuw4WUu6MSMQ}{-y0yFcfFS2S5hI-i-PQmwg}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,{20eb7db1dd823cec115dc04c8fd525da}{FmulCOttTt6S_8Ymn1dpuQ}{erKnMqWURhKJYVJMdVDv2Q}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,}, term: 20345, version: 2310376, reason: Publication{term=20345, version=2310376}
[2020-08-18T22:47:17,876][WARN ][o.e.g.G.InternalPrimaryShardAllocator] [b443e65c6dcfdab8ca3383d9e6fb6267] [X.X.X.X][0]: failed to list shard for shard_started on node [UiD7LgTyQcy3kzTAfKG7RQ]
org.elasticsearch.action.FailedNodeException: Failed node [UiD7LgTyQcy3kzTAfKG7RQ]
    at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:223) [elasticsearch-7.1.1.jar:7.1.1]
    at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$100(TransportNodesAction.java:142) [elasticsearch-7.1.1.jar:7.1.1]
    at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleException(TransportNodesAction.java:198) [elasticsearch-7.1.1.jar:7.1.1]
    at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:534) [elasticsearch-7.1.1.jar:7.1.1]
    at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.start(TransportNodesAction.java:182) [elasticsearch-7.1.1.jar:7.1.1]
    at org.elasticsearch.action.support.nodes.TransportNodesAction.doExecute(TransportNodesAction.java:82) [elasticsearch-7.1.1.jar:7.1.1]
    at org.elasticsearch.action.support.nodes.TransportNodesAction.doExecute(TransportNodesAction.java:51) [elasticsearch-7.1.1.jar:7.1.1]
    at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:146) [elasticsearch-7.1.1.jar:7.1.1]
    X.X.X.X
    at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:144) [elasticsearch-7.1.1.jar:7.1.1]
    at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:122) [elasticsearch-7.1.1.jar:7.1.1]
    at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:65) [elasticsearch-7.1.1.jar:7.1.1]
    at org.elasticsearch.gateway.TransportNodesListGatewayStartedShards.list(TransportNodesListGatewayStartedShards.java:91) [elasticsearch-7.1.1.jar:7.1.1]
    at org.elasticsearch.gateway.AsyncShardFetch.asyncFetch(AsyncShardFetch.java:283) [elasticsearch-7.1.1.jar:7.1.1]
    at org.elasticsearch.gateway.AsyncShardFetch.fetchData(AsyncShardFetch.java:126) [elasticsearch-7.1.1.jar:7.1.1]
    at org.elasticsearch.gateway.GatewayAllocator$InternalPrimaryShardAllocator.fetchData(GatewayAllocator.java:159) [elasticsearch-7.1.1.jar:7.1.1]
    at org.elasticsearch.gateway.PrimaryShardAllocator.makeAllocationDecision(PrimaryShardAllocator.java:86) [elasticsearch-7.1.1.jar:7.1.1]
    at org.elasticsearch.gateway.BaseGatewayShardAllocator.allocateUnassigned(BaseGatewayShardAllocator.java:59) [elasticsearch-7.1.1.jar:7.1.1]
    at org.elasticsearch.gateway.GatewayAllocator.innerAllocatedUnassigned(GatewayAllocator.java:114) [elasticsearch-7.1.1.jar:7.1.1]
    at org.elasticsearch.gateway.GatewayAllocator.allocateUnassigned(GatewayAllocator.java:104) [elasticsearch-7.1.1.jar:7.1.1]
    at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:410) [elasticsearch-7.1.1.jar:7.1.1]
    at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:378) [elasticsearch-7.1.1.jar:7.1.1]
    at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:361) [elasticsearch-7.1.1.jar:7.1.1]
    at org.elasticsearch.cluster.coordination.JoinTaskExecutor.execute(JoinTaskExecutor.java:155) [elasticsearch-7.1.1.jar:7.1.1]
    at org.elasticsearch.cluster.coordination.JoinHelper$1.execute(JoinHelper.java:118) [elasticsearch-7.1.1.jar:7.1.1]
    at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:687) [elasticsearch-7.1.1.jar:7.1.1]
    at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:310) [elasticsearch-7.1.1.jar:7.1.1]
    at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:210) [elasticsearch-7.1.1.jar:7.1.1]
    at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:142) [elasticsearch-7.1.1.jar:7.1.1]
    at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.1.1.jar:7.1.1]
    at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.1.1.jar:7.1.1]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:690) [elasticsearch-7.1.1.jar:7.1.1]
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.1.1.jar:7.1.1]
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.1.1.jar:7.1.1]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
    at java.lang.Thread.run(Thread.java:834) [?:?]                              
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [5d211c055f1cee31c120d23f95c958b2][X.X.X.X:9300] Node not connected
    at org.elasticsearch.transport.ConnectionManager.getConnection(ConnectionManager.java:151) ~[elasticsearch-7.1.1.jar:7.1.1]
    at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:558) ~[elasticsearch-7.1.1.jar:7.1.1]
    at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:530) ~[elasticsearch-7.1.1.jar:7.1.1]
    ... 33 more

Logs on non faulty data nodes:

[2020-08-18T22:47:11,579][INFO ][o.e.c.s.ClusterApplierService] [e60b1892bf32ee58bd969a3489c8d902] added {{20eb7db1dd823cec115dc04c8fd525da}{FmulCOttTt6S_8Ymn1dpuQ}{erKnMqWURhKJYVJMdVDv2Q}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,{265a1c0cbc47542c74dd3b16402ff8fd}{wKZH_U0OT0KHEbyDXtHd6A}{1T35_yN0RjaoL5aU2ZIBwA}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,{eba50cd36d10a05f3da30a612a2d9b4e}{Qyh9JgRESzGuw4WUu6MSMQ}{-y0yFcfFS2S5hI-i-PQmwg}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,{5d211c055f1cee31c120d23f95c958b2}{UiD7LgTyQcy3kzTAfKG7RQ}{O1QSMnBHT_y_bxQT_u9VdQ}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,}, term: 20345, version: 2310375, reason: ApplyCommitRequest{term=20345, version=2310375, sourceNode={b443e65c6dcfdab8ca3383d9e6fb6267}{kB4q0qa0S8ukCyfYCKBUKA}{cJvK-jXFSl2e6wds8Egh8Q}{X.X.X.X}{X.X.X.X:9300}X.X.X.X}
[2020-08-18T22:47:14,995][INFO ][o.e.c.s.ClusterApplierService] [e60b1892bf32ee58bd969a3489c8d902] removed {{20eb7db1dd823cec115dc04c8fd525da}{FmulCOttTt6S_8Ymn1dpuQ}{erKnMqWURhKJYVJMdVDv2Q}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,{265a1c0cbc47542c74dd3b16402ff8fd}{wKZH_U0OT0KHEbyDXtHd6A}{1T35_yN0RjaoL5aU2ZIBwA}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,{eba50cd36d10a05f3da30a612a2d9b4e}{Qyh9JgRESzGuw4WUu6MSMQ}{-y0yFcfFS2S5hI-i-PQmwg}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,{5d211c055f1cee31c120d23f95c958b2}{UiD7LgTyQcy3kzTAfKG7RQ}{O1QSMnBHT_y_bxQT_u9VdQ}{X.X.X.X}{X.X.X.X:9300}X.X.X.X,}, term: 20345, version: 2310376, reason: ApplyCommitRequest{term=20345, version=2310376, sourceNode={b443e65c6dcfdab8ca3383d9e6fb6267}{kB4q0qa0S8ukCyfYCKBUKA}{cJvK-jXFSl2e6wds8Egh8Q}{X.X.X.X}{X.X.X.X:9300}X.X.X.X}

On the faulty data nodes, we only have formation helper logs and nothing else. Even join failure logs don't appear on data nodes. No cluster state seems to be published. But we also don't see any join failures around that time, neither any publication failures.

TransportNodesListGatewayStartedShards

Appears to be failing on this call to the node and then the network call, though doesn't say if it's a network issue like timeout, or some error on the remote node; I'd sure hope you'd see something at that time in the bad node log.

Hope someone more knowledgeable has insights from what you sent, but more logs would be great, especially on the bad nodes.

Seems odd the node will join, then the master tries to allocate shards (or at lest get a shard list) and it fails, then 'disconnected'.

It's like nodes not really up or ready, or out of disk space, etc. but then restarting them just works.

Assume these not on some shared storage like a VMWare SAN or something, or a cloud with IO issues on those nodes, causing them to stall, etc. and restarting/rebooting 'fixed' them.

It does, and it's neither of these, it's a straightforward disconnection:

Caused by: org.elasticsearch.transport.NodeNotConnectedException: [5d211c055f1cee31c120d23f95c958b2][X.X.X.X:9300] Node not connected

Network disconnection (at the TCP level) is also the reason the nodes left the cluster:

node-left[{265a1c0cbc47542c74dd3b16402ff8fd}{...} disconnected, {5d211c055f1cee31c120d23f95c958b2}{...} disconnected, {20eb7db1dd823cec115dc04c8fd525da}{...} disconnected, {eba50cd36d10a05f3da30a612a2d9b4e}{...} disconnected]...

Elasticsearch never itself drops fully-established incoming connections as reported here, so this very definitely points to infrastructural problems.

Okay, though a bit confused (not unusual).

First, 'disconnected' or 'node left' means was closed by the other side, or can also timeout waiting for data on this connection? I assumed it was timing out and the master was closing / creating a 'departure'.

Second, if actively closed, what sort of infrastructure issue could cause that, assuming there is not firewall or IDS involved? I'd expect it to timeout due to an ACL or dropped packet, etc. Seems odd for infrastructure to send FIN/RST to close it.

But if it's a timeout, how to know it's infrastructure vs. a slow node, long GC, bad IO, etc. in that node?

@DavidTurner It is really strange that the join request goes through, meaning that the validate joins must have passed, which means that the connections were established. Validate join is followed by cluster state update of node-join, which in turn executes the reroute- where we see that the connections have been dropped. The nodes were not serving any traffic and CPU/JVM/network metrcics are pretty normal. Another strange thing is that this is a cycle that repeats itself continuously.

Is it possible that the nodes join an existing leader and not do validate join requests?

No, there is no timeout in Elasticsearch that would result in one of these connections being closed. They're supposed to last forever and are independent of any request-level timeouts.

Speculating on this topic is rarely fruitful I'm afraid. There are some pretty weird network configs out there, as well as some very subtle failure modes in routers and other middleware.

That does not explain disconnected, no. Elasticsearch does not itself drop fully-established incoming connections, irrespective of whether the incoming connection is from a joined node or not.

2 Likes

Thanks for reply!
It is generally rare that a network failure is so synchronised that it fails across the same 4 nodes across different AZs and not for other nodes from master for a period of 1 hour every few seconds for the same set of nodes. I understand that there is a possibility that the network between these nodes had some issues with the new master and not the old master.

The thing that I cannot come to believe is that the connection establishes as part of validate join every single time with not even a single failure, then disconnects till it reaches the reroute path, again every single time and repeats the same cycle upto the second level every 10 seconds. If this was the master issue, the other 47 nodes would have failed too. If this was a data node issue, the older master should have failed too. Note that the nodes are in different AZs.

You are right that speculation won't give me much here, but finding this really hard to believe.

Still interesting and hope you can find an underlying reasons - I hate not knowing why something happens. Couple thoughts:

Does it still happen or repeatable? If not, maybe a mystery forever.

If repeatable, maybe Wireshark to see if the node really is closing the connection, though not sure that helpful. Maybe can see data also (I think not encrypted by default but might be compressed; never looked at it).

This is on AWS, or similar cloud, as you mention AZs? If so, hard to imagine any cloud networking component that can close TCP connections, esp. on internode connections without LBs, etc. in between.

You sure the nodes were not crashing, OOM, etc. and being restarted by supervisord? That's all I can think of that could cause a remote close of a TCP connection, and would have them up/down every 10 seconds or so. Exhaustion issues like file handles, iptables slots, etc. all affect TCP establishment and cannot send a FIN/RST. But the kernel could if the process dies.

Let us know if you find anything :slight_smile: