Master election takes too long

I am using 5.4.1 version.
I see in the logs

[2019-04-16T21:42:40,996][INFO ][o.e.d.z.ZenDiscovery     ] [node-2] master_left [{node-1}{Nuzkh0bnRw6LPnqkO-KUDg}{wxOLyucfTDqZUSfspSh5PQ}{172.20.67.236}{172.20.67.236:9300}], reason [shut_down]

After 3 seconds

[2019-04-16T21:42:44,000][WARN ][o.e.d.z.ZenDiscovery     ] [node-2] failed to connect to master [{node-1}{Nuzkh0bnRw6LPnqkO-KUDg}{wxOLyucfTDqZUSfspSh5PQ}{172.20.67.236}{172.20.67.236:9300}], retrying...
org.elasticsearch.transport.ConnectTransportException: [node-1][172.20.67.236:9300] connect_timeout[1s]

and after 3 seconds

[2019-04-16T21:42:47,024][INFO ][o.e.c.s.ClusterService   ] [node-2] detected_master {node-3}{3DbhJ-TaSoGvAmVBYOE3aw}{FL_uo65LS7yGmjxsUVPmXQ}{172.20.67.240}{172.20.67.240:9300}, reason: zen-disco-receive(from master [master {node-3}{3DbhJ-TaSoGvAmVBYOE3aw}{FL_uo65LS7yGmjxsUVPmXQ}{172.20.67.240}{172.20.67.240:9300} committed version [27]])
[2019-04-16T21:42:47,025][WARN ][o.e.c.NodeConnectionsService] [node-2] failed to connect to node {node-1}{Nuzkh0bnRw6LPnqkO-KUDg}{wxOLyucfTDqZUSfspSh5PQ}{172.20.67.236}{172.20.67.236:9300} (tried [1] times)

My configuration is:

discovery.zen.commit_timeout: 2s
discovery.zen.publish_timeout: 2s
discovery.zen.fd.ping_timeout: 1s
transport.tcp.connect_timeout: 1s

What can I change in order to reduce the 6-7 seconds time of failover?

These timeouts are dangerously short. I would expect this cluster to become unstable under any kind of realistic load.

7.0.0 has a new, faster, election mechanism, and in 7.1.0 we also expect to avoid synchronous connections to failed nodes thanks to #39629. I recommend upgrading.

Thank you for the quick reply.
Let's say I will not use any of this configurations.
How much time should I expect ES to failover and elect a new master in 5.4.1?

It depends on exactly how the master fails. If you simply shut it down then, with default settings, I'd expect a new master to be elected within a little over 4 seconds:

  • 1 second to discover the master has disconnected
  • 3 seconds to elect the new master
  • a few network round-trips to complete the process and re-establish the cluster

In my case master was shut down. It took 6-7 seconds with the configurations shown above.
Does this make sense? I need to configure a logical timeout in my application. If 5 seconds is safe (the 4 you said + 1 buffer), I will use it.

I find it a little surprising that it took 7 seconds if the master were cleanly shut down. Can you share a more comprehensive set of logs from all of the nodes?

Sure.

node-1 is the master and stopped cleanly

[2019-04-16T21:38:58,779][INFO ][o.e.c.s.ClusterService   ] [node-1] added {{node-3}{3DbhJ-TaSoGvAmVBYOE3aw}{FL_uo65LS7yGmjxsUVPmXQ}{172.20.67.240}{172.20.67.240:9300},}, reason: zen-disco-node-join[{node-3}{3DbhJ-TaSoGvAmVBYOE3aw}{FL_uo65LS7yGmjxsUVPmXQ}{172.20.67.240}{172.20.67.240:9300}]
[2019-04-16T21:39:00,444][INFO ][o.e.c.r.a.AllocationService] [node-1] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[events_1555318236265][0]] ...]).
[2019-04-16T21:42:40,979][INFO ][o.e.n.Node               ] [node-1] stopping ...
[2019-04-16T21:42:41,154][INFO ][o.e.n.Node               ] [node-1] stopped
[2019-04-16T21:42:41,154][INFO ][o.e.n.Node               ] [node-1] closing ...
[2019-04-16T21:42:41,160][INFO ][o.e.n.Node               ] [node-1] closed

node-2

[2019-04-16T21:42:40,996][INFO ][o.e.d.z.ZenDiscovery     ] [node-2] master_left [{node-1}{Nuzkh0bnRw6LPnqkO-KUDg}{wxOLyucfTDqZUSfspSh5PQ}{172.
20.67.236}{172.20.67.236:9300}], reason [shut_down]
[2019-04-16T21:42:40,997][WARN ][o.e.d.z.ZenDiscovery     ] [node-2] master left (reason = shut_down), current nodes: nodes: 
{node-1}{Nuzkh0bnRw6LPnqkO-KUDg}{wxOLyucfTDqZUSfspSh5PQ}  {172.20.67.236}{172.20.67.236:9300}, master
{node-2}{SiCuliEBS8WzO0mBISJNnA}{DzVFJK5FRj-5FRnBxEtjsg}{172.20.67.238}{172.20.67.238:9300}, local
{node-3}{3DbhJ-TaSoGvAmVBYOE3aw}{FL_uo65LS7yGmjxsUVPmXQ}{172.20.67.240}{172.20.67.240:9300}

[2019-04-16T21:42:44,000][WARN ][o.e.d.z.ZenDiscovery     ] [node-2] failed to connect to master [{node-1}{Nuzkh0bnRw6LPnqkO-KUDg}{wxOLyucfTDqZ
USfspSh5PQ}{172.20.67.236}{172.20.67.236:9300}], retrying...
org.elasticsearch.transport.ConnectTransportException: [node-1][172.20.67.236:9300] connect_timeout[1s]

Took node-2 4 seconds from the time master left to the time it failed to connect... why?

[2019-04-16T21:42:47,024][INFO ][o.e.c.s.ClusterService   ] [node-2] detected_master {node-3}{3DbhJ-TaSoGvAmVBYOE3aw}{FL_uo65LS7yGmjxsUVPmXQ}{172.20.67.240}{172.20.67.240:9300}, reason: zen-disco-receive(from master [master {node-3}{3DbhJ-TaSoGvAmVBYOE3aw}{FL_uo65LS7yGmjxsUVPmXQ}{172.20.67.240}{172.20.67.240:9300} committed version [27]])
[2019-04-16T21:42:47,025][WARN ][o.e.c.NodeConnectionsService] [node-2] failed to connect to node {node-1}{Nuzkh0bnRw6LPnqkO-KUDg}{wxOLyucfTDqZUSfspSh5PQ}{172.20.67.236}{172.20.67.236:9300} (tried [1] times)

Took 3 more seconds to detect the master and still trying to connect the old one. why?

node-3

[2019-04-16T21:42:40,994][INFO ][o.e.d.z.ZenDiscovery     ] [node-3] master_left [{node-1}{Nuzkh0bnRw6LPnqkO-KUDg}{wxOLyucfTDqZUSfspSh5PQ}{172.
20.67.236}{172.20.67.236:9300}], reason [shut_down]
[2019-04-16T21:42:40,997][WARN ][o.e.d.z.ZenDiscovery     ] [node-3] master left (reason = shut_down), current nodes: nodes: 
{node-2}{SiCuliEBS8WzO0mBISJNnA}{DzVFJK5FRj-5FRnBxEtjsg}{172.20.67.238}{172.20.67.238:9300}
{node-1}{Nuzkh0bnRw6LPnqkO-KUDg}{wxOLyucfTDqZUSfspSh5PQ}{172.20.67.236}{172.20.67.236:9300}, master
{node-3}{3DbhJ-TaSoGvAmVBYOE3aw}{FL_uo65LS7yGmjxsUVPmXQ}{172.20.67.240}{172.20.67.240:9300}, local

[2019-04-16T21:42:41,000][WARN ][o.e.c.NodeConnectionsService] [node-3] failed to connect to node {node-1}{Nuzkh0bnRw6LPnqkO-KUDg}{wxOLyucfTDqZ
USfspSh5PQ}{172.20.67.236}{172.20.67.236:9300} (tried [1] times)

[2019-04-16T21:42:47,015][INFO ][o.e.c.s.ClusterService   ] [node-3] new_master {node-3}{3DbhJ-TaSoGvAmVBYOE3aw}{FL_uo65LS7yGmjxsUVPmXQ}{172.20.67.240}{172.20.67.240:9300}, reason: zen-disco-elected-as-master ([1] nodes joined)[{node-2}{SiCuliEBS8WzO0mBISJNnA}{DzVFJK5FRj-5FRnBxEtjsg}{172.20.67.238}{172.20.67.238:9300}]

Trying to perform actions via PreBuiltTransportClient from the java application got these errors when retrying 3 times at a period of 4 seconds

WARN 2019-04-16 21:42:43,106 [EventPublisher~i31c] : ElasticManagementClient(ElasticManagementClient.performAction:242) - Failed to perform action, retries left : 2, got exception
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];

WARN 2019-04-16 21:42:45,111 [EventPublisher~i31c] : ElasticManagementClient(ElasticManagementClient.performAction:242) - Failed to perform action, retries left : 1, got exception
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];

WARN 2019-04-16 21:42:47,113 [EventPublisher~i31c] : ElasticManagementClient(ElasticManagementClient.performAction:242) - Failed to perform action, retries left : 0, got exception
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];

This is a little odd as it indicates that node-2 still thinks that node-1 is the master. Could you share the whole stack trace from this exception, including any Caused by inner exceptions?

Also, can you tell us a bit more about your environment? In particular, are you running in Docker?

1 Like

I agree, it is odd...
This is the stack

[2019-04-16T21:42:44,000][WARN ][o.e.d.z.ZenDiscovery     ] [node-2] failed to connect to master [{node-1}{Nuzkh0bnRw6LPnqkO-KUDg}{wxOLyucfTDqZUSfspSh5PQ}{172.20.67.236}{172.20.67.236:9300}], retrying...
org.elasticsearch.transport.ConnectTransportException: [node-1][172.20.67.236:9300] connect_timeout[1s]
    at org.elasticsearch.transport.netty4.Netty4Transport.connectToChannels(Netty4Transport.java:360) ~[?:?]
    at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:534) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:473) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:315) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:302) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.discovery.zen.ZenDiscovery.joinElectedMaster(ZenDiscovery.java:468) [elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.discovery.zen.ZenDiscovery.innerJoinCluster(ZenDiscovery.java:420) [elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.discovery.zen.ZenDiscovery.access$4100(ZenDiscovery.java:83) [elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.discovery.zen.ZenDiscovery$JoinThreadControl$1.run(ZenDiscovery.java:1197) [elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.4.1.jar:5.4.1]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 172.20.67.236/172.20.67.236:9300
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:?]
    at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:352) ~[?:?]
    at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) ~[?:?]
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632) ~[?:?]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544) ~[?:?]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) ~[?:?]
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) ~[?:?]
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) ~[?:?]
    ... 1 more
Caused by: java.net.ConnectException: Connection refused
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:?]
    at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:352) ~[?:?]
    at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) ~[?:?]
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632) ~[?:?]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544) ~[?:?]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) ~[?:?]
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) ~[?:?]
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) ~[?:?]
    ... 1 more

I am not using docker and the nodes are not virtual.
All nodes configured to be data and master eligible.
Any more information is needed?
Again, thank you for your help.

Ok, I think I can reproduce this by shutting the master down very soon after another node has joined the cluster. Is that what's happening here?

When a node starts up it tries to discover the rest of the cluster, and it keeps hold of the information it discovers for a short while (6 seconds) afterwards. If it needs to perform another election it might use stale information and end up electing a master that's no longer there.

I think that the gap between the join in and the shutting down of the other node was 4 minutes.

Adding the logs:

node-2

[2019-04-16T21:38:58,786][INFO ][o.e.c.s.ClusterService   ] [node-2] added {{node-3}{3DbhJ-TaSoGvAmVBYOE3aw}{FL_uo65LS7yGmjxsUVPmXQ}{172.20.67.240}{172.20.67.240:9300},}, reason: zen-disco-receive(from master [master {node-1}{Nuzkh0bnRw6LPnqkO-KUDg}{wxOLyucfTDqZUSfspSh5PQ}{172.20.67.236}{172.20.67.236:9300} committed version [24]])
[2019-04-16T21:42:40,996][INFO ][o.e.d.z.ZenDiscovery     ] [node-2] master_left [{node-1}{Nuzkh0bnRw6LPnqkO-KUDg}{wxOLyucfTDqZUSfspSh5PQ}{172.20.67.236}{172.20.67.236:9300}], reason [shut_down]
[2019-04-16T21:42:40,997][WARN ][o.e.d.z.ZenDiscovery     ] [node-2] master left (reason = shut_down), current nodes: nodes: 
{node-1}{Nuzkh0bnRw6LPnqkO-KUDg}{wxOLyucfTDqZUSfspSh5PQ}{172.20.67.236}{172.20.67.236:9300}, master
{node-2}{SiCuliEBS8WzO0mBISJNnA}{DzVFJK5FRj-5FRnBxEtjsg}{172.20.67.238}{172.20.67.238:9300}, local
{node-3}{3DbhJ-TaSoGvAmVBYOE3aw}{FL_uo65LS7yGmjxsUVPmXQ}{172.20.67.240}{172.20.67.240:9300}

node-3

[2019-04-17T10:17:35,138][INFO ][o.e.h.n.Netty4HttpServerTransport] [node-3] publish_address {172.20.67.240:9200}, bound_addresses {172.20.67.240:9200}
[2019-04-17T10:17:35,140][INFO ][o.e.n.Node               ] [node-3] started
[2019-04-17T10:17:35,408][INFO ][o.e.g.GatewayService     ] [node-3] recovered [1] indices into cluster_state
[2019-04-17T10:17:35,676][INFO ][o.e.c.r.a.AllocationService] [node-3] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[events_1555318236265][0]] ...]).
[2019-04-17T10:17:37,549][INFO ][o.e.c.r.a.AllocationService] [node-3] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[events_1555318236265][0]] ...]).

It's quite hard to see clearly all that is going on from these small log excerpts, so I'm having to do a certain amount of speculation. I note, for instance, that node-3 joined the cluster here too, just before node-1 was shut down:

Can you reproduce what you're seeing with logger.org.elasticsearch.discovery: TRACE and share the complete logs from all of the nodes?

I will try, it will take at least a day. Thank you

1 Like

Reproduced, I think. node-3 was shutting down.
Can't put here the logs since there is a limitation.
Adding links if that's ok:

node-1

node-2

node-3

The node-2 link doesn't work, could you double-check it?

Trying again
node-2

Ok, I see. node-3 takes around 200ms to stop:

[2019-04-18T12:53:55,648][INFO ][o.e.n.Node               ] [node-3] stopping ...
[2019-04-18T12:53:55,823][INFO ][o.e.n.Node               ] [node-3] stopped

One of the first things it does is announce that it's shutting down, triggering another election:

[2019-04-18T12:53:55,659][INFO ][o.e.d.z.ZenDiscovery     ] [node-1] master_left [{node-3}{8v1o1noDS5e_e7TAgI9K7A}{WCrG1STSQMWu6MCMRiHIqQ}{172.20.71.55}{172.20.71.55:9300}], reason [shut_down]
[2019-04-18T12:53:55,659][WARN ][o.e.d.z.ZenDiscovery     ] [node-1] master left (reason = shut_down), current nodes: nodes:
   {node-2}{nS8tF5PhTmau1Jc8mJhx_A}{Xw7nuJ5IS4uEgUulQEb3YQ}{172.20.71.33}{172.20.71.33:9300}
   {node-3}{8v1o1noDS5e_e7TAgI9K7A}{WCrG1STSQMWu6MCMRiHIqQ}{172.20.71.55}{172.20.71.55:9300}, master
   {node-1}{qtQ9zxHGS0yu8VyS_GDCuw}{a3ufzEfyQAubZvHWX9GXZg}{172.20.71.43}{172.20.71.43:9300}, local

[2019-04-18T12:53:55,659][DEBUG][o.e.d.z.MasterFaultDetection] [node-1] [master] stopping fault detection against master [{node-3}{8v1o1noDS5e_e7TAgI9K7A}{WCrG1STSQMWu6MCMRiHIqQ}{172.20.71.55}{172.20.71.55:9300}], reason [master left (reason = shut_down)]
[2019-04-18T12:53:55,659][TRACE][o.e.d.z.NodeJoinController] [node-1] starting an election context, will accumulate joins

However, node-1 reacts very quickly to the news, starts its discovery phase, and discovers node-3 within under 20ms:

[2019-04-18T12:53:55,659][TRACE][o.e.d.z.ZenDiscovery     ] [node-1] starting to ping
[2019-04-18T12:53:55,660][TRACE][o.e.d.z.UnicastZenPing   ] [node-1] resolved host [172.20.71.43] to [172.20.71.43:9300]
[2019-04-18T12:53:55,660][TRACE][o.e.d.z.UnicastZenPing   ] [node-1] resolved host [172.20.71.33] to [172.20.71.33:9300]
[2019-04-18T12:53:55,660][TRACE][o.e.d.z.UnicastZenPing   ] [node-1] resolved host [172.20.71.55] to [172.20.71.55:9300]
[2019-04-18T12:53:55,660][TRACE][o.e.d.z.UnicastZenPing   ] [node-1] [3] sending to {node-3}{8v1o1noDS5e_e7TAgI9K7A}{WCrG1STSQMWu6MCMRiHIqQ}{172.20.71.55}{172.20.71.55:9300}
[2019-04-18T12:53:55,660][TRACE][o.e.d.z.UnicastZenPing   ] [node-1] [3] sending to {node-1}{qtQ9zxHGS0yu8VyS_GDCuw}{a3ufzEfyQAubZvHWX9GXZg}{172.20.71.43}{172.20.71.43:9300}
[2019-04-18T12:53:55,660][TRACE][o.e.d.z.UnicastZenPing   ] [node-1] [3] sending to {node-2}{nS8tF5PhTmau1Jc8mJhx_A}{Xw7nuJ5IS4uEgUulQEb3YQ}{172.20.71.33}{172.20.71.33:9300}
[2019-04-18T12:53:55,661][TRACE][o.e.d.z.UnicastZenPing   ] [node-1] [3] received response from {node-1}{qtQ9zxHGS0yu8VyS_GDCuw}{a3ufzEfyQAubZvHWX9GXZg}{172.20.71.43}{172.20.71.43:9300}: [ping_response{node [{node-1}{qtQ9zxHGS0yu8VyS_GDCuw}{a3ufzEfyQAubZvHWX9GXZg}{172.20.71.43}{172.20.71.43:9300}], id[24], master [null],cluster_state_version [230], cluster_name[elastic-infinibox]}, ping_response{node [{node-1}{qtQ9zxHGS0yu8VyS_GDCuw}{a3ufzEfyQAubZvHWX9GXZg}{172.20.71.43}{172.20.71.43:9300}], id[25], master [null],cluster_state_version [230], cluster_name[elastic-infinibox]}]
[2019-04-18T12:53:55,661][TRACE][o.e.d.z.UnicastZenPing   ] [node-1] [3] received response from {node-2}{nS8tF5PhTmau1Jc8mJhx_A}{Xw7nuJ5IS4uEgUulQEb3YQ}{172.20.71.33}{172.20.71.33:9300}: [ping_response{node [{node-1}{qtQ9zxHGS0yu8VyS_GDCuw}{a3ufzEfyQAubZvHWX9GXZg}{172.20.71.43}{172.20.71.43:9300}], id[24], master [null],cluster_state_version [230], cluster_name[elastic-infinibox]}, ping_response{node [{node-2}{nS8tF5PhTmau1Jc8mJhx_A}{Xw7nuJ5IS4uEgUulQEb3YQ}{172.20.71.33}{172.20.71.33:9300}], id[8], master [{node-3}{8v1o1noDS5e_e7TAgI9K7A}{WCrG1STSQMWu6MCMRiHIqQ}{172.20.71.55}{172.20.71.55:9300}],cluster_state_version [230], cluster_name[elastic-infinibox]}]
[2019-04-18T12:53:55,661][TRACE][o.e.d.z.UnicastZenPing   ] [node-1] [3] received response from {node-3}{8v1o1noDS5e_e7TAgI9K7A}{WCrG1STSQMWu6MCMRiHIqQ}{172.20.71.55}{172.20.71.55:9300}: [ping_response{node [{node-1}{qtQ9zxHGS0yu8VyS_GDCuw}{a3ufzEfyQAubZvHWX9GXZg}{172.20.71.43}{172.20.71.43:9300}], id[24], master [null],cluster_state_version [230], cluster_name[elastic-infinibox]}, ping_response{node [{node-3}{8v1o1noDS5e_e7TAgI9K7A}{WCrG1STSQMWu6MCMRiHIqQ}{172.20.71.55}{172.20.71.55:9300}], id[27], master [{node-3}{8v1o1noDS5e_e7TAgI9K7A}{WCrG1STSQMWu6MCMRiHIqQ}{172.20.71.55}{172.20.71.55:9300}],cluster_state_version [230], cluster_name[elastic-infinibox]}]

I think we could call this a bug: node-3 arguably shouldn't respond to pings after it's announced that it's shutting down. The only maintained version with this bug is now 6.7 (it doesn't occur in 7.0 or later) and I don't think it's critical enough to warrant a fix there.

This means that what I said earlier isn't the case - it now looks like it can take ~7 seconds to elect a new master because sometimes there will be two elections, each taking 3 seconds. TIL.

1 Like

Thank you very much for the help. We will increase our application's timeouts to give master election longer grace period. 15 seconds would be enough in your opinion?

Yes, that should be ample time to discover a master after shutting the master down.

Thank you!