Problem with master re-election under ZenDiscovery (5.2.2)

I am testing recovery. I have a 5.2.2 cluster with 3 master-elegible nodes. When I kill the elected master node (in this case master-0), the entire cluster shuts down (except for Kibana). The gist of the problem seems to be:

  1. elected master killed
  2. other two master-eligible nodes learn that the elected master has left the cluster, but then keep trying to connect to it anyway, eventually throwing an exception and quitting
  3. all the other elasticsearch nodes give up

Looking at these logs for the master-1 node, it correctly learns that master-0 has left the cluster, but then continues to list it as a current node, and then tries and fails to reconnect to it:

[2017-03-14T22:45:10,208][INFO ][o.e.d.z.ZenDiscovery ] [master-1-node] master_left [{master-0-node}{SzqZbCT_SfGjukio5cYrBg}{v58Bm3MKR4-Ucf1cEmBkmw}{10.0.1.79}{10.0.1.79:9300}], reason [transport disconnected]
[2017-03-14T22:45:10,210][WARN ][o.e.d.z.ZenDiscovery ] [master-1-node] master left (reason = transport disconnected), current nodes: nodes:
{master-1-node}{DL_sqiHoRg2bLCiD9tWh-w}{Ty-AlN5cRTWEBObmCJ9Z3g}{10.0.1.87}{10.0.1.87:9300}, local
{ingest-0-node}{tsGRRLHWSG6DOZ-b9A-wRg}{XXhtyKeLS9W4x6HmpNxitw}{10.0.2.99}{10.0.2.99:1028}
{data-0-node}{szxoWgObSaS8gE4BaSn5nw}{vC1s4MiXQPa9d2CEsyrLCg}{10.0.1.227}{10.0.1.227:1026}
{coordinator-0-node}{OxleJ2AcRlib9I18Vm1mOw}{DmUC-IUCTDW8GULDHYxbfQ}{10.0.1.227}{10.0.1.227:1028}
{master-0-node}{SzqZbCT_SfGjukio5cYrBg}{v58Bm3MKR4-Ucf1cEmBkmw}{10.0.1.79}{10.0.1.79:9300}, master
{master-2-node}{wr9bqzZ4SeeJflwtE88hSA}{7rCvWpSLSuOdIk6nohbGuw}{10.0.1.190}{10.0.1.190:9300}
{data-1-node}{WVL4cVOlTFesySvUv2gR0g}{jG105Vj9Rl6J5C-qFE19dw}{10.0.2.99}{10.0.2.99:1026}

[2017-03-14T22:45:10,226][WARN ][o.e.c.NodeConnectionsService] [master-1-node] failed to connect to node {master-0-node}{SzqZbCT_SfGjukio5cYrBg}{v58Bm3MKR4-Ucf1cEmBkmw}{10.0.1.79}{10.0.1.79:9300} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [master-0-node][10.0.1.79:9300] connect_timeout[30s]
at org.elasticsearch.transport.netty4.Netty4Transport.connectToChannels(Netty4Transport.java:370) ~[?:?]
at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:495) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:460) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:318) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:305) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.cluster.NodeConnectionsService.validateNodeConnected(NodeConnectionsService.java:121) [elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.cluster.NodeConnectionsService.connectToNodes(NodeConnectionsService.java:87) [elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.cluster.service.ClusterService.publishAndApplyChanges(ClusterService.java:775) [elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.cluster.service.ClusterService.runTasks(ClusterService.java:628) [elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:1112) [elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:527) [elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:238) [elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:201) [elasticsearch-5.2.2.jar:5.2.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_112]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_112]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_112]
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 10.0.1.79/10.0.1.79:9300
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:?]
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:346) ~[?:?]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:630) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:527) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:481) ~[?:?]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:441) ~[?:?]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) ~[?:?]
... 1 more
[2017-03-14T22:45:10,349][INFO ][o.e.n.Node ] [master-1-node] stopping ...

Is this a bug or am I just doing something really wrong somewhere else?

What does your config look like?

There are 3 masters, 2 data, 1 ingest, and 1 coordinator node in the cluster. The general elasticsearch.yml template looks like this:

discovery.zen.ping.unicast.hosts:

  • "master-0-node.{{FRAMEWORK_NAME}}.mesos:{{MASTER_NODE_TRANSPORT_PORT}}"
  • "master-1-node.{{FRAMEWORK_NAME}}.mesos:{{MASTER_NODE_TRANSPORT_PORT}}"
  • "master-2-node.{{FRAMEWORK_NAME}}.mesos:{{MASTER_NODE_TRANSPORT_PORT}}"
    discovery.zen.minimum_master_nodes: 2
    network.host:
  • "site"
  • "local"
    cluster.name: {{FRAMEWORK_NAME}}
    node.name: {{TASK_NAME}}
    path.data: "container-path/data"
    path.logs: "container-path/logs"
    http.port: {{PORT_HTTP}}
    transport.tcp.port: {{PORT_TRANSPORT}}
    bootstrap.memory_lock: false
    node.master: {{MASTER_ENABLED}}
    node.data: {{DATA_ENABLED}}
    node.ingest: {{INGEST_ENABLED}}
    metrics.statsd.host: {{STATSD_UDP_HOST}}
    metrics.statsd.port: {{STATSD_UDP_PORT}}

It then gets interpolated for each node based on those env vars.

The thing that has me worried is how ZenDiscovery is listing master-0 as the master node even though it knows it's no longer the master and has disconnected. Is there anything configuration-related that could cause this to happen?

I fired up a local 3 node cluster of 5.2.2 to compare the re-election behavior.

When I kill -9 the elected master, I get the same master left + listing the master in the logs message, so that is a red herring. But the exception stack trace is different.

Compare local cluster with working re-election:

[2017-03-15T09:38:23,254][WARN ][o.e.c.NodeConnectionsService] [XPM5620] failed to connect to node {go_IoI6}{go_IoI6_TfyCZ-L_keqNVA}{g9nPt7mDQL-kYetiOMwfqA}{127.0.0.1}{127.0.0.1:9300} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [go_IoI6][127.0.0.1:9300] general node connection failure
Caused by: java.lang.IllegalStateException: handshake failed
Caused by: org.elasticsearch.transport.TransportException: connection reset
[2017-03-15T09:38:26,274][INFO ][o.e.c.s.ClusterService ] [XPM5620] detected_master {Q0FdVdM}{Q0FdVdMVS2S_1yxXDA127g}{Abt6D7rlRYSMZ4yS6P5BQQ}{127.0.0.1}{127.0.0.1:9302}, reason: zen-disco-receive(from master [master {Q0FdVdM}{Q0FdVdMVS2S_1yxXDA127g}{Abt6D7rlRYSMZ4yS6P5BQQ}{127.0.0.1}{127.0.0.1:9302} committed version [8]])
[2017-03-15T09:38:26,340][INFO ][o.e.c.s.ClusterService ] [XPM5620] removed {{go_IoI6}{go_IoI6_TfyCZ-L_keqNVA}{g9nPt7mDQL-kYetiOMwfqA}{127.0.0.1}{127.0.0.1:9300},}, reason: zen-disco-receive(from master [master {Q0FdVdM}{Q0FdVdMVS2S_1yxXDA127g}{Abt6D7rlRYSMZ4yS6P5BQQ}{127.0.0.1}{127.0.0.1:9302} committed version [9]])

to distributed cluster with broken re-election:

[2017-03-14T22:45:10,226][WARN ][o.e.c.NodeConnectionsService] [master-1-node] failed to connect to node {master-0-node}{SzqZbCT_SfGjukio5cYrBg}{v58Bm3MKR4-Ucf1cEmBkmw}{10.0.1.79}{10.0.1.79:9300} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [master-0-node][10.0.1.79:9300] connect_timeout[30s]
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 10.0.1.79/10.0.1.79:9300
[2017-03-14T22:45:10,349][INFO ][o.e.n.Node ] [master-1-node] stopping ...

They both throw a org.elasticsearch.transport.ConnectTransportException, but for different reasons. Seems like when Netty complains about "Connection refused", it's fatal.

I tried using netty3 instead of netty4 on a whim, but no luck. Actually I tried security3 as I'm using X-Pack:

transport.type: security3
http.type: security3

I find it peculiar that the node try to reconnect to master 0 immediately after processing the fact that it was gone:

[2017-03-14T22:45:10,208][INFO ][o.e.d.z.ZenDiscovery ] [master-1-node] master_left [{master-0-node} ...
[2017-03-14T22:45:10,210][WARN ][o.e.d.z.ZenDiscovery ] [master-1-node] master left (reason = transport ...
[2017-03-14T22:45:10,226][WARN ][o.e.c.NodeConnectionsService] [master-1-node] failed to connect to node {master-0-node}{SzqZbCT_SfGjukio5cYrBg}{v58Bm3MKR4-Ucf1cEmBkmw}{10.0.1.79}{10.0.1.79:9300} (tried [1] 

Note how the timestamp is all in the second. A default master election would take 3 seconds. Do you have any other discovery related setting (for example: discovery.zen.ping_timeout) ?

Nope, all of my setting are included above. Once the elected master is killed, the other nodes stop themselves in 2ms. Not 1500x later in 3s, and definitely not in 30s as the timeout log message implies.

Note that you can reproduce the try-to-reconnect-to-the-dead-master behavior on your laptop easily. Launch 3 elasticsearch processes (set node.max_local_storage_nodes: 3 in elasticsearch.yml) and then kill -9 the elected master. You'll see the other 2 nodes log the same peculiar message.

These log messages are weird, but I'm not sure it's related to my problem. I'm just trying to find out why Netty is throwing an exception that is causing elasticsearch to exit when the master node is killed.

I tried to reproduce and things work as expected - I kill the master and after 3s a new one is elected:

[2017-03-15T22:49:14,810][INFO ][o.e.d.z.ZenDiscovery     ] [i4uS5Uw] master_left [{EYz9Qqb}{EYz9QqbfReyHtTYWNekjFw}{oh19CSG4SZSWNz2bxCvCNQ}{127.0.0.1}{127.0.0.1:9300}], reason [transport disconnected]
[2017-03-15T22:49:14,812][WARN ][o.e.d.z.ZenDiscovery     ] [i4uS5Uw] master left (reason = transport disconnected), current nodes: nodes:
[2017-03-15T22:49:14,826][WARN ][o.e.c.NodeConnectionsService] [i4uS5Uw] failed to connect to node {EYz9Qqb}{EYz9QqbfReyHtTYWNekjFw}{oh19CSG4SZSWNz2bxCvCNQ}{127.0.0.1}{127.0.0.1:9300} (tried [1] times)
[2017-03-15T22:49:17,838][INFO ][o.e.c.s.ClusterService   ] [i4uS5Uw] new_master {i4uS5Uw}{i4uS5UwUQvyZnrPyMiJrHg}{Gr5m-tNyRye5sxIt0NjvlA}{127.0.0.1}{127.0.0.1:9301}, reason: zen-disco-elected-as-master ([1] nodes joined)[{k_eeWpg}{k_eeWpgeRKC86SZNFMQlTg}{3yhpYtfsS1GkfNL9_25e-Q}{127.0.0.1}{127.0.0.1:9302}]

As you can see I too have another warning about trying to connect to the old master. That has to do with the fact that only an active master can add or remove nodes from the cluster state. Since the master suddenly died the other nodes note that and start an election. As part of it they change the local cluster state to indicate that there isn't any active master (but not remove the node as they don't have a mandate for that). Since the cluster state still contains the node, it triggers an attempt to reconnect, which fails. This is harmless.

I am curious to know more about why no new cluster state is elected, if you have more information

Thank you Boaz for looking into it! I created this github issue and put all the information I have in it: https://github.com/elastic/elasticsearch/issues/23615

Looks like this behavior showed up in 5.2.1 or 5.2.2.

I enabled trace on Zen discovery like this:

curl -XPUT -u elastic:changeme 'localhost:1025/_cluster/settings' -d '{"transient": {"logger.discovery.zen":"TRACE"}}'
{"acknowledged":true,"persistent":{},"transient":{"logger":{"discovery":{"zen":"TRACE"}}}}

Unfortunately the log output is the same when I reproduce the issue.

What other information can I provide to help diagnose this?

With 5.x you should use a full class path for the logging. That means setting logger.org.elasticsearch.discovery.zen. Post here when you have something.

Cheers,
Boaz

In the case below, master-2 was elected. Here are trace logs from master-0. No additional logs for master-2 were generated after it received kill -9.

The bit that jumps out at me is here:

[2017-03-20T17:16:36,267][TRACE][o.e.d.z.ZenDiscovery     ] [master-0-node] not enough master nodes [[Candidate{node={master-0-node}{j4AxqF1PSlOPUYgNCHL1rA}{4JNSDZLlSn2cqzQOgJit0g}{10.0.3.84}{10.0.3.84:9300}, clusterStateVersion=37}]]
[2017-03-20T17:16:36,268][TRACE][o.e.d.z.ZenDiscovery     ] [master-0-node] thread is no longer in currentJoinThread. Stopping.

I wonder why master-0 thinks there are not enough masters. It's obviously eligible itself, it just heard back from master-1, and discovery.zen.minimum_master_nodes is 2.

The lines you quoted were generated after you have stopped the node. As you can see, the pinging was interrupted and this it's result not used.

[2017-03-20T17:16:36,262][INFO ][o.e.n.Node               ] [master-0-node] stopping ...
[2017-03-20T17:16:36,267][TRACE][o.e.d.z.ZenDiscovery     ] [master-0-node] pingAndWait interrupted

Can post the result of a few completed elections cycles?

The logs are from the master-0 node, which stopped itself. I did not stop it. Can you get a sense from the logs why it would shut itself down?

The exception in there seems to suggest that the Connection refused it gets when trying to connect to the prior-elected but newly-deceased master-2 is not an outcome it is able to handle. Am I mis-interpreting that?

Do you want me to capture TRACE logs on the initial election when the cluster is created? That's the only successful election that takes place in the life of these clusters.

Can you get a sense from the logs why it would shut itself down?

The only time where ES shuts down it self is when it had an uncaught exception like OOM or something similar. Those will be clearly logged so we know this didn't happen here. Something else cause ES to be shut down. Maybe the oom killer of the OS?

The exception in there seems to suggest that the Connection refused it gets when trying to connect to the prior-elected but newly-deceased master-2 is not an outcome it is able to handle. Am I mis-interpreting that?

It does try to connect, because you never know if it's a network hickup or a true death. This is not a problem and is part of normal pinging. That's why it's logged under trace.

Do you want me to capture TRACE logs on the initial election when the cluster is created? That's the only successful election that takes place in the life of these clusters.

I want to see a full run of an unsuccessful election. Preferably a couple because in rare cases an election may fail and another should be started which should be OK.

No, that's SIGKILL, it wouldn't be a clean stop like we are seeing here.

Good one. Thanks.

I found out what's causing those non-elected-master nodes to stop...they are failing an external health check when the master is lost. There's a health check that runs locally and pings / every 1s like this:

curl  -I -s -f -u kibana:changeme localhost:1026

It's set up pretty aggressively with no tolerance for retries, so as soon as the health check fails once, the elasticsearch node is killed off and restarted. Based on these docs, I assumed the remaining nodes would continue to return success at / even when the master is lost and during the re-election process. This is not the case. The nodes return HTTP >= 400 for up to 3 seconds.

Looking at this issue, it seems master-loss behavior has been a source of debate for 2+ years now:

Another aspect to consider here - on master loss (since 1.4), all data nodes will have a master block for 3s. If you hit / at that moment, no node will be happy and that's I believe this ticket is about. - Boaz

I got turned around on so many levels chasing this down. The connection refused stack trace seemed like the obvious problem but it was totally unrelated and not even a problem. The health check requests show up in the logs when it's a 200, but during the block they aren't logged at all so I thought the nodes were failing on their own. Finally, I didn't understand that during re-election you lose all access to your cluster for 3s regardless of discovery.zen.no_master_block. This 3s break seems to be how Elastic Cloud operates, too.

The workaround for me is to increase the health check retries to > 3.

Thanks Boaz et al for helping me work this out!

Glad you found it.

What do you mean exactly by this? is it your logs?

The workaround for me is to increase the health check retries to > 3.

I'm not sure about this one. In almost all case where a node is having problems, the best response is to let the node recover it self. Restarts are harsh. I would personally automatically restart a node only after, say, 10 minutes of consecutive bad behavior.

By "logs" I mean the same stdout file that contained the Elasticsearch stack trace. It also contains the output of the health check.

That's good advice on the health check. My intention here was/is to guard against an elasticsearch process that for whatever reason is running but is not responsive. But returning HTTP >= 400 is technically still responding, so I should adjust my health check's expectations.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.