Cluster recovery and reachability takes long time when master left

kaner · January 25, 2019, 10:24am

Hi,

we're test running a 3 node cluster of Elasticsearch, version 5.6.9.

There seems to be a problem when the master node becomes unreachable by the other 2 nodes (for example, when the network cable gets pulled). The remaining nodes take about 1 minute or longer to recover. During this time, the cluster is not reachable for requests.

We're testing reachability as follows:

while true; do timeout 1 curl --connect-timeout 1 -XGET http://10.231.83.146:9202/_cluster/health; echo ""; date; sleep 1; done

When the master leaves, no curl requests are answered for about 1 minute. We ran these tests several times. It can take up to 2 minutes before the cluster becomes responsive again.

My question to you:

Is this normal/to be expected?
If not, is there a way to fix it?

Thanks for any pointers in advance!

Here are the configurations:

Node 1:

cluster.name: testcluster
network.host: 10.231.83.238
http.enabled: true
http.port: 9200
transport.tcp.port: 9300
discovery.zen.ping.unicast.hosts: ["10.231.83.238:9300", "10.231.83.146:9301", "10.231.83.146:9302"]
discovery.zen.minimum_master_nodes: 2
discovery.zen.fd.ping_timeout: 4s
http.cors.enabled: true
http.cors.allow-origin: /https?:\/\/localhost(:[0-9]+)?/

Node 2:

cluster.name: testcluster
network.host: 10.231.83.146
http.enabled: true
http.port: 9201
transport.tcp.port: 9301
discovery.zen.ping.unicast.hosts: ["10.231.83.238:9300", "10.231.83.146:9301", "10.231.83.146:9302"]
discovery.zen.minimum_master_nodes: 2
discovery.zen.fd.ping_timeout: 4s
http.cors.enabled: true
http.cors.allow-origin: /https?:\/\/localhost(:[0-9]+)?/

Node 3:

cluster.name: testcluster
network.host: 10.231.83.146
http.enabled: true
http.port: 9202
transport.tcp.port: 9302
discovery.zen.ping.unicast.hosts: ["10.231.83.238:9300", "10.231.83.146:9301", "10.231.83.146:9302"]
discovery.zen.minimum_master_nodes: 2
discovery.zen.fd.ping_timeout: 4s
http.cors.enabled: true
http.cors.allow-origin: /https?:\/\/localhost(:[0-9]+)?/

Here is an example log output. Please note that Node 1 is master before we pull the network cable. Because I'm not allowed to post more than 7000 characters, I've put them on pastebin.

Node 1: https://pastebin.com/TCa2Wuia
Node 2: https://pastebin.com/7sKv0Z0z
Node 3: https://pastebin.com/FzGvT7Ji

kaner · January 25, 2019, 12:49pm

Looks like this issue is discribed here:

Current workaround seems to be to set transport.tcp.connect_timeout to 2s in the elasticsearch.yml and sysctl -w net.ipv4.tcp_retries2=3, which reduces the downtime to about 10-20 seconds. Still not great, but better than nothing.

Anything I'm missing?

DavidTurner · January 25, 2019, 2:34pm

This sounds similar to the discussion here:

Reduce net.ipv4.tcp_retries2 and the connection timeout and you should see improvements. Even 10-20 seconds sounds like a long time for the cluster to recover with those settings set appropriately, and I'd be interested to see logs from a recovery that did take that long.

Edit: it looks like you have reduced the ping timeout to 4s in Elasticsearch, which will help it detect the connection drop a little quicker but can harm your cluster stability since it will remove nodes from the cluster if they pause for a few seconds of GC. It's much better to detect the connection drop with net.ipv4.tcp_retries2 since this is independent of GC.

kaner · January 28, 2019, 1:24pm

Thank you for the reply, David.

I've adapted the configuration according to your suggestions:

cluster.name: testcluster
network.host: 10.231.83.238
http.enabled: true
http.port: 9200
transport.tcp.port: 9300
transport.tcp.connect_timeout: 1s
discovery.zen.ping.unicast.hosts: ["10.231.83.238:9300", "10.231.83.146:9301", "10.231.83.146:9302"]
discovery.zen.minimum_master_nodes: 2
discovery.zen.fd.ping_timeout: 30s
http.cors.enabled: true
http.cors.allow-origin: /https?:\/\/localhost(:[0-9]+)?/

Also set net.ipv4.tcp_retries2=3 on all nodes.

Running the curl script against the cluster and unplugging the network cable at exactly 14:09:00 gives me the following results:

{"cluster_name":"testcluster","status":"green","timed_out":false,"number_of_nodes":3,"number_of_data_nodes":3,"active_primary_shards":9,"active_shards":22,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0}
Mon Jan 28 14:08:59 CET 2019
{"cluster_name":"testcluster","status":"green","timed_out":false,"number_of_nodes":3,"number_of_data_nodes":3,"active_primary_shards":9,"active_shards":22,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0}
Mon Jan 28 14:09:00 CET 2019
Mon Jan 28 14:09:02 CET 2019
Mon Jan 28 14:09:04 CET 2019
Mon Jan 28 14:09:06 CET 2019
Mon Jan 28 14:09:09 CET 2019
Mon Jan 28 14:09:11 CET 2019
Mon Jan 28 14:09:13 CET 2019
{"cluster_name":"testcluster","status":"green","timed_out":false,"number_of_nodes":3,"number_of_data_nodes":3,"active_primary_shards":9,"active_shards":22,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":3,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":217,"active_shards_percent_as_number":100.0}
Mon Jan 28 14:09:14 CET 2019
{"cluster_name":"testcluster","status":"yellow","timed_out":false,"number_of_nodes":2,"number_of_data_nodes":2,"active_primary_shards":9,"active_shards":14,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":8,"delayed_unassigned_shards":8,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":63.63636363636363}
Mon Jan 28 14:09:15 CET 2019

As you can see, the cluster wasn't reachable for 13 seconds. I'd be happy to have it recover quicker to be honest. If you have any suggestions, let me know.

Here are the logs:

Node 1 (master at the time of unplugging the network cable, I don't think we'll need the full log here, right?):

[2019-01-28T14:07:52,674][INFO ][o.e.c.r.DelayedAllocationService] [yRj9xlW] scheduling reroute for delayed shards in [59.4s] (9 delayed shards)
[2019-01-28T14:08:40,627][INFO ][o.e.c.s.ClusterService   ] [yRj9xlW] added {{wAVZzky}{wAVZzkyrSfyBmuGRw0Bltw}{1itL4IuvRxS01hZwL-T76g}{10.231.83.146}{10.231.83.146:9302},}, reason: zen-disco-node-join
[2019-01-28T14:08:43,446][INFO ][o.e.c.r.a.AllocationService] [yRj9xlW] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[device][0]] ...]).

Node 2: https://pastebin.com/WXYTx4mK
Node 3: https://pastebin.com/8yUBgqgf

DavidTurner · January 28, 2019, 1:52pm

Node 2 was elected as the master after the network dropped, and looking at its logs I see this timeline:

[2019-01-28T14:09:08,512][INFO ][o.e.d.z.ZenDiscovery     ] [lJgJwYh] master_left [{yRj9xlW}{yRj9xlW8Q-WgQ0j6H4DbOQ}{SEn_C0KDTbSvoD580O9KUg}{10.231.83.238}{10.231.83.238:9300}], reason [transport disconnected]
[2019-01-28T14:09:11,802][INFO ][o.e.c.s.ClusterService   ] [lJgJwYh] new_master {lJgJwYh}{lJgJwYhmQA-fucaOBHD5Vg}{D5xXwnQrTHy2wgsN7qUFIg}{10.231.83.146}{10.231.83.146:9301}, reason: zen-disco-elected-as-master ([1] nodes joined)[, ]
[2019-01-28T14:09:13,863][INFO ][o.e.c.s.ClusterService   ] [lJgJwYh] removed {{yRj9xlW}{yRj9xlW8Q-WgQ0j6H4DbOQ}{SEn_C0KDTbSvoD580O9KUg}{10.231.83.238}{10.231.83.238:9300},}, reason: zen-disco-node-failed({yRj9xlW}{yRj9xlW8Q-WgQ0j6H4DbOQ}{SEn_C0KDTbSvoD580O9KUg}{10.231.83.238}{10.231.83.238:9300}), reason(transport disconnected)

Assuming that the clocks are correct and the old master went down at exactly 14:09:00, it took 8.5 seconds to detect that it was disconnected, then 3 seconds to perform another election, and then another 2 seconds to remove the old master and become healthy.

The 3-second election is expected in 6.x, but we've been working on this recently and it will be quicker in 7.0. It's provably impossible to guarantee how long an election takes, but we typically see them in the ~100ms range in 7.0.

The 2 seconds to remove the old master makes sense, given that it tries to connect to it twice, each of which times out after transport.tcp.connect_timeout: 1s. We would like to streamline this (see e.g. #29025) but do not know when that work will take place yet.

The 8.5 seconds to detect the disconnection seems surprisingly long with net.ipv4.tcp_retries2=3. I would expect it to take less than 2 seconds: pings are sent every second, and the first 3 retries should take well under a second. Is this a cluster you are setting up specifically for this test and which has no other traffic? If so, can you collect packet captures from the two healthy nodes using tcpdump and share them here?

DavidTurner · January 28, 2019, 2:45pm

It seems my expectations aren't right. I tried this on a test network and saw it perform more than the 3 retries it's configured to. The link went down at 14:26:40.560, and here's the traffic on the fault-detection connection between the old master (10.10.10.102) and the new master (10.10.10.101):

14:26:35.885276 IP 10.10.10.103.37078 > 10.10.10.102.9300: Flags [S], seq 1219331720, win 29200, options [mss 1460,sackOK,TS val 2126002432 ecr 0,nop,wscale 7], length 0
14:26:35.885315 IP 10.10.10.102.9300 > 10.10.10.103.37078: Flags [S.], seq 3641735812, ack 1219331721, win 28960, options [mss 1460,sackOK,TS val 1729357237 ecr 2126002432,nop,wscale 7], length 0
14:26:35.885331 IP 10.10.10.103.37078 > 10.10.10.102.9300: Flags [.], ack 1, win 229, options [nop,nop,TS val 2126002432 ecr 1729357237], length 0
14:26:39.124588 IP 10.10.10.103.37078 > 10.10.10.102.9300: Flags [P.], seq 1:361, ack 1, win 229, options [nop,nop,TS val 2126005672 ecr 1729357237], length 360
14:26:39.124650 IP 10.10.10.102.9300 > 10.10.10.103.37078: Flags [.], ack 361, win 235, options [nop,nop,TS val 1729360477 ecr 2126005672], length 0
14:26:39.127494 IP 10.10.10.102.9300 > 10.10.10.103.37078: Flags [P.], seq 1:22, ack 361, win 235, options [nop,nop,TS val 1729360479 ecr 2126005672], length 21
14:26:39.127533 IP 10.10.10.103.37078 > 10.10.10.102.9300: Flags [.], ack 22, win 229, options [nop,nop,TS val 2126005675 ecr 1729360479], length 0
14:26:40.129585 IP 10.10.10.103.37078 > 10.10.10.102.9300: Flags [P.], seq 361:721, ack 22, win 229, options [nop,nop,TS val 2126006677 ecr 1729360479], length 360
14:26:40.130171 IP 10.10.10.102.9300 > 10.10.10.103.37078: Flags [P.], seq 22:43, ack 721, win 243, options [nop,nop,TS val 1729361482 ecr 2126006677], length 21
14:26:40.130192 IP 10.10.10.103.37078 > 10.10.10.102.9300: Flags [.], ack 43, win 229, options [nop,nop,TS val 2126006677 ecr 1729361482], length 0
14:26:41.131989 IP 10.10.10.103.37078 > 10.10.10.102.9300: Flags [P.], seq 721:1081, ack 43, win 229, options [nop,nop,TS val 2126007679 ecr 1729361482], length 360
14:26:41.337080 IP 10.10.10.103.37078 > 10.10.10.102.9300: Flags [P.], seq 721:1081, ack 43, win 229, options [nop,nop,TS val 2126007884 ecr 1729361482], length 360
14:26:41.545085 IP 10.10.10.103.37078 > 10.10.10.102.9300: Flags [P.], seq 721:1081, ack 43, win 229, options [nop,nop,TS val 2126008092 ecr 1729361482], length 360
14:26:41.969170 IP 10.10.10.103.37078 > 10.10.10.102.9300: Flags [P.], seq 721:1081, ack 43, win 229, options [nop,nop,TS val 2126008516 ecr 1729361482], length 360
14:26:42.801172 IP 10.10.10.103.37078 > 10.10.10.102.9300: Flags [P.], seq 721:1081, ack 43, win 229, options [nop,nop,TS val 2126009348 ecr 1729361482], length 360
14:26:44.465178 IP 10.10.10.103.37078 > 10.10.10.102.9300: Flags [P.], seq 721:1081, ack 43, win 229, options [nop,nop,TS val 2126011012 ecr 1729361482], length 360

The ping sent at 14:26:41.131989 is retried five times, backing off exponentially, before the new master finally detects that the old one has gone away at 14:26:47.141 and starts the election process:

[2019-01-28T14:26:47,141][INFO ][o.e.d.z.ZenDiscovery     ] [master-3] master_left [{master-2}{40Ve6BHKR9Ggl_anEwQqww}{HrGHl0FUQvuZTv4wEMvaYw}{10.10.10.102}{10.10.10.102:9300}{xpack.installed=true}], reason [failed to ping, tried [3] times, each with  maximum [2s] timeout]

I expected it to notice that the node was unavailable after the third retry, at 14:26:42.801172, but it spends another 5 seconds on another two retries first. I don't have any great ideas for improving this.

DavidTurner · January 28, 2019, 3:05pm

I'd like to dig into this statement a bit more. The cluster wasn't responding to health requests for those 13 seconds, but for the first 8 of those seconds I think it would continue to accept indexing traffic and serving searches - it only stops doing so once the failure of the master is detected. Can you verify that this is the case?

kaner · January 28, 2019, 9:14pm

Thank you very much for the interesting and detailed response, David!

First of all, our project will be going on for a while so I'm glad to hear about version 7.0 (beta), which we'll test as soon as possible. Although this will take a while until after February. I'll gladly share the results here.

Just to be sure I am getting this correctly, we configured tcp_retries2 to be 3, yet the TCP stack decides it wants to do 5 pings? That's odd, especially given the following Linux kernel doc https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt:

tcp_retries2 - INTEGER
   This value influences the timeout of an alive TCP connection,
   when RTO retransmissions remain unacknowledged.
   Given a value of N, a hypothetical TCP connection following
   exponential backoff with an initial RTO of TCP_RTO_MIN would
   retransmit N times before killing the connection at the (N+1)th RTO.

Could the extra 2 be coming from tcp_retries1?

I must say I'm confused. If this is vanilla Linux TCP behaviour, I've learned something new today.

Unfortunately, I'm not able to do further testing for the next two+ weeks, as I'm not in the office where the test lab is located. If you want/need the results earlier, I can ask a colleague to get them for you. Otherwise I'll let you know in 2-3 weeks when I get back to recovery testing.

DavidTurner · January 29, 2019, 10:55am

The docs are not fantastic, but I don't think tcp_retries1 has any effect here. I tried setting it to 1 (down from its default of 3) and still saw 5 retries as below. The network dropped at 2019-01-29T09:50:19.0.

09:50:12.201299 IP 10.10.10.101.49310 > 10.10.10.102.9300: Flags [S], seq 184377679, win 29200, options [mss 1460,sackOK,TS val 2072892255 ecr 0,nop,wscale 7], length 0
09:50:12.201346 IP 10.10.10.102.9300 > 10.10.10.101.49310: Flags [S.], seq 368511452, ack 184377680, win 28960, options [mss 1460,sackOK,TS val 4088184297 ecr 2072892255,nop,wscale 7], length 0
09:50:12.201364 IP 10.10.10.101.49310 > 10.10.10.102.9300: Flags [.], ack 1, win 229, options [nop,nop,TS val 2072892255 ecr 4088184297], length 0
09:50:15.358581 IP 10.10.10.101.49310 > 10.10.10.102.9300: Flags [P.], seq 1:361, ack 1, win 229, options [nop,nop,TS val 2072895412 ecr 4088184297], length 360
09:50:15.358653 IP 10.10.10.102.9300 > 10.10.10.101.49310: Flags [.], ack 361, win 235, options [nop,nop,TS val 4088187454 ecr 2072895412], length 0
09:50:15.362297 IP 10.10.10.102.9300 > 10.10.10.101.49310: Flags [P.], seq 1:22, ack 361, win 235, options [nop,nop,TS val 4088187458 ecr 2072895412], length 21
09:50:15.362354 IP 10.10.10.101.49310 > 10.10.10.102.9300: Flags [.], ack 22, win 229, options [nop,nop,TS val 2072895416 ecr 4088187458], length 0
09:50:16.365532 IP 10.10.10.101.49310 > 10.10.10.102.9300: Flags [P.], seq 361:721, ack 22, win 229, options [nop,nop,TS val 2072896419 ecr 4088187458], length 360
09:50:16.367020 IP 10.10.10.102.9300 > 10.10.10.101.49310: Flags [P.], seq 22:43, ack 721, win 243, options [nop,nop,TS val 4088188463 ecr 2072896419], length 21
09:50:16.367081 IP 10.10.10.101.49310 > 10.10.10.102.9300: Flags [.], ack 43, win 229, options [nop,nop,TS val 2072896421 ecr 4088188463], length 0
09:50:17.369619 IP 10.10.10.101.49310 > 10.10.10.102.9300: Flags [P.], seq 721:1081, ack 43, win 229, options [nop,nop,TS val 2072897423 ecr 4088188463], length 360
09:50:17.371116 IP 10.10.10.102.9300 > 10.10.10.101.49310: Flags [P.], seq 43:64, ack 1081, win 252, options [nop,nop,TS val 4088189467 ecr 2072897423], length 21
09:50:17.371167 IP 10.10.10.101.49310 > 10.10.10.102.9300: Flags [.], ack 64, win 229, options [nop,nop,TS val 2072897425 ecr 4088189467], length 0
09:50:18.372393 IP 10.10.10.101.49310 > 10.10.10.102.9300: Flags [P.], seq 1081:1441, ack 64, win 229, options [nop,nop,TS val 2072898426 ecr 4088189467], length 360
09:50:18.372925 IP 10.10.10.102.9300 > 10.10.10.101.49310: Flags [P.], seq 64:85, ack 1441, win 260, options [nop,nop,TS val 4088190469 ecr 2072898426], length 21
09:50:18.372945 IP 10.10.10.101.49310 > 10.10.10.102.9300: Flags [.], ack 85, win 229, options [nop,nop,TS val 2072898427 ecr 4088190469], length 0
09:50:19.374692 IP 10.10.10.101.49310 > 10.10.10.102.9300: Flags [P.], seq 1441:1801, ack 85, win 229, options [nop,nop,TS val 2072899429 ecr 4088190469], length 360
09:50:19.581181 IP 10.10.10.101.49310 > 10.10.10.102.9300: Flags [P.], seq 1441:1801, ack 85, win 229, options [nop,nop,TS val 2072899635 ecr 4088190469], length 360
09:50:19.789186 IP 10.10.10.101.49310 > 10.10.10.102.9300: Flags [P.], seq 1441:1801, ack 85, win 229, options [nop,nop,TS val 2072899843 ecr 4088190469], length 360
09:50:20.213196 IP 10.10.10.101.49310 > 10.10.10.102.9300: Flags [P.], seq 1441:1801, ack 85, win 229, options [nop,nop,TS val 2072900267 ecr 4088190469], length 360
09:50:21.041114 IP 10.10.10.101.49310 > 10.10.10.102.9300: Flags [P.], seq 1441:1801, ack 85, win 229, options [nop,nop,TS val 2072901095 ecr 4088190469], length 360
09:50:22.705160 IP 10.10.10.101.49310 > 10.10.10.102.9300: Flags [P.], seq 1441:1801, ack 85, win 229, options [nop,nop,TS val 2072902759 ecr 4088190469], length 360

As far as I know there's nothing strange about my test machine, here's its uname -a:

Linux worker-772973 4.15.0-34-generic #37~16.04.1-Ubuntu SMP Tue Aug 28 10:44:06 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

I am running these Elasticsearch nodes within Docker containers, all on the same host. I do not think this should make a difference but it would be good if you could validate this on your setup.

kaner · February 19, 2019, 10:16am

On both CentOS 7 and with kernel 3.10.0 and Debian Stretch with kernel 4.9.0 I'm seeing the same results as you: 5 retries are sent even though tcp_retries2 is set to 3. I've also tried setting tcp_retries1 to 1, same result.

I'm now thinking about setting tcp_retries2 to 1, therefore effectively reducing the retries to 3.

Edit: Reducing tcp_retries2 to 1 effects in faster recovery from 13s to 8s. I'm not sure there will be no side-effects though, I'll better conduct some longer term tests.

kaner · February 19, 2019, 2:53pm

I did a quick test with 7.0.0-beta1, looks like the whole recovery now takes about 8s with net.ipv4.tcp_retries2=3. Much quicker than the previous 13s!

system · March 19, 2019, 2:53pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Simulating network connect failure between nodes Elasticsearch	2	663	January 1, 2017
Cluster failures Elasticsearch	2	284	July 6, 2017
Cluster stopped working but was working fine Elasticsearch	8	766	November 23, 2018
Long period of querying failure during node timeout Elasticsearch	4	1039	May 15, 2020
ES 6.0 timeout on cluster Elasticsearch	9	1160	January 18, 2018

Cluster recovery and reachability takes long time when master left

Related topics