Cluster instability if master leader is stopped

rocketraman · June 16, 2018, 5:20pm

I have an ES 6.2.4 cluster. The cluster runs inside Kubernetes and contains 3 master and 3 data nodes with zen discovery. The cluster is idle. I am testing some failure scenarios and trying to get a handle on some odd behavior.

The minimum master nodes is set to 2.

The cluster operates normally, and deals with node failure without fanfare, except if the master leader node is stopped. Then all hell breaks loose until the node comes back.

Here, es-master-2, the previous leader node is shutdown normally:

[2018-06-16T15:52:28,847][INFO ][o.e.n.Node               ] [es-master-2] stopping ...
[2018-06-16T15:52:28,941][INFO ][o.e.x.w.WatcherService   ] [es-master-2] stopping watch service, reason [shutdown initiated]
[2018-06-16T15:52:29,207][INFO ][o.e.n.Node               ] [es-master-2] stopped
[2018-06-16T15:52:29,208][INFO ][o.e.n.Node               ] [es-master-2] closing ...
[2018-06-16T15:52:29,221][INFO ][o.e.n.Node               ] [es-master-2] closed

and a few seconds later, es-master-0 seems to think es-master-1 is now the leader:

[2018-06-16T15:52:32,004][INFO ][o.e.c.s.ClusterApplierService] [es-master-0] detected_master {es-master-1}{MJeNWUFgT7yNk-_8w1LZyQ}{uxnRZt7ARYGm7ASYNMMDwg}{10.2.4.223}{10.2.4.223:9300}, reason: apply cluster state (from master [master {es-master-1}{MJeNWUFgT7yNk-_8w1LZyQ}{uxnRZt7ARYGm7ASYNMMDwg}{10.2.4.223}{10.2.4.223:9300} committed version [66]])

and es-master-1 seems to agree:

[2018-06-16T15:53:02,001][INFO ][o.e.c.s.ClusterApplierService] [es-master-1] new_master {es-master-1}{MJeNWUFgT7yNk-_8w1LZyQ}{uxnRZt7ARYGm7ASYNMMDwg}{10.2.4.223}{10.2.4.223:9300}, reason: apply cluster state (from master [master {es-master-1}{MJeNWUFgT7yNk-_8w1LZyQ}{uxnRZt7ARYGm7ASYNMMDwg}{10.2.4.223}{10.2.4.223:9300} committed version [66] source [zen-disco-elected-as-master ([1] nodes joined)[{es-master-0}{kJBd1zqaQUGvUgOKEpK3BA}{kvzS_rXuSLip-iC-WMnJlQ}{10.2.5.134}{10.2.5.134:9300}]]])

However, a few seconds after that, both remaining master nodes start throwing Exceptions relating to being unable to connect to the missing es-master-2 node:

[2018-06-16T15:53:02,011][WARN ][o.e.c.NodeConnectionsService] [es-master-0] failed to connect to node {es-master-2}{bZzPyOqeT1SmVMYkRzoEng}{95EaEY6nR6CWg6nMZ4TqvA}{10.2.4.112}{10.2.4.112:9300} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [es-master-2][10.2.4.112:9300] connect_timeout[30s]

and

[2018-06-16T15:53:05,570][WARN ][o.e.c.NodeConnectionsService] [es-master-1] failed to connect to node {es-master-2}{bZzPyOqeT1SmVMYkRzoEng}{95EaEY6nR6CWg6nMZ4TqvA}{10.2.4.112}{10.2.4.112:9300} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [es-master-2][10.2.4.112:9300] connect_timeout[30s]

(which is probably ok, as they are just trying to reconnect to a node that is still down...)

but then what is not ok is that they start getting cluster state timeout failures as well e.g. on es-master-0:

[2018-06-16T15:53:02,104][WARN ][o.e.c.s.ClusterApplierService] [es-master-0] cluster state applier task [apply cluster state (from master [master {es-master-1}{MJeNWUFgT7yNk-_8w1LZyQ}{uxnRZt7ARYGm7ASYNMMDwg}{10.2.4.223}{10.2.4.223:9300} committed version [66]])] took [30s] above the warn threshold of 30s
[2018-06-16T15:53:31,949][INFO ][o.e.d.z.ZenDiscovery     ] [es-master-0] failed to send join request to master [{es-master-1}{MJeNWUFgT7yNk-_8w1LZyQ}{uxnRZt7ARYGm7ASYNMMDwg}{10.2.4.223}{10.2.4.223:9300}], reason [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]
[2018-06-16T15:53:35,761][INFO ][o.e.c.s.ClusterApplierService] [es-master-0] removed {{es-master-2}{bZzPyOqeT1SmVMYkRzoEng}{95EaEY6nR6CWg6nMZ4TqvA}{10.2.4.112}{10.2.4.112:9300},{es-data-2}{EI8p9YoYSvm1kK5rueMY1A}{fUqNGOBkQLC5HAk4vYix5g}{10.2.4.44}{10.2.4.44:9300},{es-data-1}{W0j6dEurSN6uh7eKog-DtA}{UmdxdduOTAiybh-Uf1ih7Q}{10.2.5.85}{10.2.5.85:9300},{es-data-0}{O47gtbITQBmKnQ1UBw_IuA}{nwUyzpouR0KoozNiRoIbXg}{10.2.4.101}{10.2.4.101:9300},}, reason: apply cluster state (from master [master {es-master-1}{MJeNWUFgT7yNk-_8w1LZyQ}{uxnRZt7ARYGm7ASYNMMDwg}{10.2.4.223}{10.2.4.223:9300} committed version [67]])

as well as failures on es-master-1:

[2018-06-16T15:53:01,999][WARN ][o.e.d.z.PublishClusterStateAction] [es-master-1] timed out waiting for all nodes to process published state [66] (timeout [30s], pending nodes: [{es-data-1}{W0j6dEurSN6uh7eKog-DtA}{UmdxdduOTAiybh-Uf1ih7Q}{10.2.5.85}{10.2.5.85:9300}, {es-data-2}{EI8p9YoYSvm1kK5rueMY1A}{fUqNGOBkQLC5HAk4vYix5g}{10.2.4.44}{10.2.4.44:9300}, {es-data-0}{O47gtbITQBmKnQ1UBw_IuA}{nwUyzpouR0KoozNiRoIbXg}{10.2.4.101}{10.2.4.101:9300}, {es-master-0}{kJBd1zqaQUGvUgOKEpK3BA}{kvzS_rXuSLip-iC-WMnJlQ}{10.2.5.134}{10.2.5.134:9300}])
[2018-06-16T15:53:02,001][INFO ][o.e.c.s.ClusterApplierService] [es-master-1] new_master {es-master-1}{MJeNWUFgT7yNk-_8w1LZyQ}{uxnRZt7ARYGm7ASYNMMDwg}{10.2.4.223}{10.2.4.223:9300}, reason: apply cluster state (from master [master {es-master-1}{MJeNWUFgT7yNk-_8w1LZyQ}{uxnRZt7ARYGm7ASYNMMDwg}{10.2.4.223}{10.2.4.223:9300} committed version [66] source [zen-disco-elected-as-master ([1] nodes joined)[{es-master-0}{kJBd1zqaQUGvUgOKEpK3BA}{kvzS_rXuSLip-iC-WMnJlQ}{10.2.5.134}{10.2.5.134:9300}]]])
[2018-06-16T15:53:35,684][WARN ][o.e.c.s.ClusterApplierService] [es-master-1] cluster state applier task [apply cluster state (from master [master {es-master-1}{MJeNWUFgT7yNk-_8w1LZyQ}{uxnRZt7ARYGm7ASYNMMDwg}{10.2.4.223}{10.2.4.223:9300} committed version [66] source [zen-disco-elected-as-master ([1] nodes joined)[{es-master-0}{kJBd1zqaQUGvUgOKEpK3BA}{kvzS_rXuSLip-iC-WMnJlQ}{10.2.5.134}{10.2.5.134:9300}]]])] took [33.6s] above the warn threshold of 30s
[2018-06-16T15:53:35,690][WARN ][o.e.c.s.MasterService    ] [es-master-1] cluster state update task [zen-disco-elected-as-master ([1] nodes joined)[{es-master-0}{kJBd1zqaQUGvUgOKEpK3BA}{kvzS_rXuSLip-iC-WMnJlQ}{10.2.5.134}{10.2.5.134:9300}]] took [1m] above the warn threshold of 30s

and now the entire cluster becomes unstable -- data nodes start dropping out also. The cluster remains red/unstable until the es-master-2 node is back up and running, at which point things restabilize.

The full logs from the three masters are here: https://drive.google.com/open?id=16R3eSgfuIDUqLqqkzuM_jHLxIiwTH4Zc

The discovery service currently points (discovery.zen.ping.unicast.hosts) to a DNS entry which contains each of the masters, but I've also tried this with each master node as a comma-delimited list, and the behavior is effectively the same:

# nslookup elasticsearch-discovery
[...]
Name:   elasticsearch-discovery.default.svc.cluster.local
Address: 10.2.4.223
Address: 10.2.4.112
Address: 10.2.5.134

rocketraman · June 18, 2018, 5:18pm

I updated the cluster to 6.3.0 and still have the same issue.

DavidTurner · June 26, 2018, 8:19am

I suspect something strange in your Docker networking config. On election, the new master (es-master-1) is disconnected from the data nodes:

[2018-06-16T15:53:35,702][WARN ][o.e.t.n.Netty4Transport  ] [es-master-1] send message failed [channel: NettyTcpChannel{localAddress=0.0.0.0/0.0.0.0:9300, remoteAddress=/10.2.5.85:37894}]
java.nio.channels.ClosedChannelException: null
        at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]

Furthermore, its attempts to reconnect are actively refused:

[2018-06-16T15:54:08,078][WARN ][o.e.c.NodeConnectionsService] [es-master-1] failed to connect to node {es-data-1}{W0j6dEurSN6uh7eKog-DtA}{UmdxdduOTAiybh-Uf1ih7Q}{10.2.5.85}{10.2.5.85:9300} (tried [1] time
org.elasticsearch.transport.ConnectTransportException: [es-data-1][10.2.5.85:9300] connect_exception
        at org.elasticsearch.transport.TcpChannel.awaitConnected(TcpChannel.java:165) ~[elasticsearch-6.2.4.jar:6.2.4]
        at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:616) ~[elasticsearch-6.2.4.jar:6.2.4]
        at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:513) ~[elasticsearch-6.2.4.jar:6.2.4]
        at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:331) ~[elasticsearch-6.2.4.jar:6.2.4]
        at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:318) ~[elasticsearch-6.2.4.jar:6.2.4]
        at org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:154) [elasticsearch-6.2.4.jar:6.2.4]
        at org.elasticsearch.cluster.NodeConnectionsService$1.doRun(NodeConnectionsService.java:107) [elasticsearch-6.2.4.jar:6.2.4]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) [elasticsearch-6.2.4.jar:6.2.4]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.2.4.jar:6.2.4]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 10.2.5.85/10.2.5.85:9300
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:?]
        at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323) ~[?:?]
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:545) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:499) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) ~[?:?]
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) ~[?:?]
        ... 1 more
Caused by: java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:?]
        at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323) ~[?:?]
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:545) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:499) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) ~[?:?]
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) ~[?:?]
        ... 1 more

rocketraman · June 28, 2018, 8:15pm

I think the connection refused errors are happening as a result, not as a cause. The timestamp of the error you showed is 2018-06-16T15:53:35,702. But you can see that the errors about failures connecting to the newly elected master are before that:

[2018-06-16T15:53:02,104][WARN ][o.e.c.s.ClusterApplierService] [es-master-0] cluster state applier task [apply cluster state (from master [master {es-master-1}{MJeNWUFgT7yNk-_8w1LZyQ}{uxnRZt7ARYGm7ASYNMMDwg}{10.2.4.223}{10.2.4.223:9300} committed version [66]])] took [30s] above the warn threshold of 30s
[2018-06-16T15:53:31,949][INFO ][o.e.d.z.ZenDiscovery     ] [es-master-0] failed to send join request to master [{es-master-1}{MJeNWUFgT7yNk-_8w1LZyQ}{uxnRZt7ARYGm7ASYNMMDwg}{10.2.4.223}{10.2.4.223:9300}], reason [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]

Based on the timeline, I think the reason the data nodes are going down is that they can also no longer connect to the master, which causes them to go down, which is then causing the "connection refused" error you see as a result.

DavidTurner · June 28, 2018, 8:42pm

These are not errors about failures connecting to the newly-elected master. They are timeout messages which I think are to be expected because es-master-2 has vanished.

Elasticsearch nodes do not "go down" or refuse connections because of a change of (or a lack of) an elected master.

rocketraman · June 28, 2018, 11:52pm

These are not errors about failures connecting to the newly-elected master. They are timeout messages which I think are to be expected because es-master-2 has vanished.

Hmm, I don't think so. es-master-2 vanished, but the message below states that es-master-0 has timed out attempting to apply cluster state from es-master-1 -- and this is before any errors from the data nodes, and after es-master-1 has successfully been elected as the new master:

[2018-06-16T15:53:02,104][WARN ][o.e.c.s.ClusterApplierService] [es-master-0] cluster state applier task [apply cluster state (from master [master {es-master-1}{MJeNWUFgT7yNk-_8w1LZyQ}{uxnRZt7ARYGm7ASYNMMDwg}{10.2.4.223}{10.2.4.223:9300} committed version [66]])] took [30s] above the warn threshold of 30s

DavidTurner · June 29, 2018, 6:25am

Yes, that's right. The first cluster state update from es-master-1, establishing its leadership, will include es-master-2, causing es-master-0 to try and reconnect to es-master-2, which takes 30 seconds to time out. Docker is special here: outside Docker, the connection attempt is normally immediately refused, and issue #29025 is something we're working on to improve this.

However this is a transient issue: after the timeout the next cluster state update should remove es-master-2 and the cluster should recover. It doesn't explain the other connections being dropped and refused, and that is the unexpected thing in your logs.

system · July 27, 2018, 6:28am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ES failure for few seconds during master re-elect Elasticsearch	4	547	July 6, 2017
Shutdown master means breakdown the cluster's service? Elasticsearch	8	2335	July 6, 2017
Nodes fail to join cluster - potential split brain scenario Elasticsearch	11	563	July 6, 2017
Problem with master re-election under ZenDiscovery (5.2.2) Elasticsearch	19	4697	April 19, 2017
Master election issue? Elasticsearch	4	371	July 6, 2017

Cluster instability if master leader is stopped

Related topics