Shard rebalancing is slow after network failure on any node

Ninad_Pradhan · January 17, 2019, 11:49pm

we are using elastic search 6.3.1, we have a 3 node cluster 1 tb ssd, 30 gb ram on the machine, 8gb allocated to es, 2 masters minimum, in our testing if one of the es node's network goes down, elastic takes about 7 to 10 minutes to be responsive again and the indices take that long to become green again. However if you just stop the es service on one of the machine, then we dont see re-balancing take that long. Has anyone else faced this issue.

There are our setting....

cluster.name: icecluster
node.name: icecluster-10.10.20.11
node.master: true
node.data: true
node.ingest: false

path.data: /data
path.logs: /var/log/elasticsearch
network.host: 0.0.0.0
http.port: 9200
discovery.zen.ping.unicast.hosts: [10.10.20.11,10.10.20.12,10.10.20.13]
discovery.zen.minimum_master_nodes: 2
thread_pool:
bulk:
queue_size: 1000000
index:
queue_size: 1000000

Christian_Dahlqvist · January 18, 2019, 6:11am

Why do you have such large queue sizes? How many indices and shards do you have in the cluster?

Ninad_Pradhan · January 18, 2019, 6:18pm

73 indices and about 330 total shards in the cluster, we have realtime data that we very high frequency queries, anyways Ive tried dropping the queue size to default and has the same issue.

Also the issue is that not only does the status not become green, but data is inaccessible for almost 10 mins, however, if you simply do service elasticsearch stop on one of the nodes it behaves just fine, the issue occurs only if network cable is plugged out on one node or interface is shut down.

DavidTurner · January 18, 2019, 6:56pm

This could be a network configuration issue: it sounds like your network isn't reporting a failure fast enough. What do these commands output?

$ cat /proc/sys/net/ipv4/tcp_retries2
$ cat /proc/sys/net/ipv4/tcp_syn_retries

Also it looks like transport.tcp.connect_timeout is set to the default, which is a rather conservative 30 seconds. This applies to all outbound connections that Elasticsearch tries to make, including node-to-node connections. You might want to reduce this.

Possibly this is related to #29025.

Ninad_Pradhan · January 18, 2019, 7:32pm

Its cat /proc/sys/net/ipv4/tcp_retries2
15

cat /proc/sys/net/ipv4/tcp_syn_retries
6
yes transport.tcp.connect_timeout is 30s

however...

on the logs of the active nodes I see a message instantly suggesting the network failure was recognized in-fact.

[2019-01-18T11:29:08,672][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [icecluster-10.10.20.12] failed to execute on node [tXI9LBseSJGrjQo87eJ4lw]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [icecluster-10.10.20.11][10.10.20.11:9300][cluster:monitor/nodes/stats[n]] request_id [179994] timed out after [15001ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:987) [elasticsearch-6.3.1.jar:6.3.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:626) [elasticsearch-6.3.1.jar:6.3.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_191]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_191]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
[2019-01-18T11:30:04,445][INFO ][o.e.c.r.a.AllocationService] [icecluster-10.10.20.12] Cluster health status changed from [GREEN] to [YELLOW] (reason: ).
[2019-01-18T11:30:04,445][INFO ][o.e.c.s.MasterService ] [icecluster-10.10.20.12] zen-disco-node-failed({icecluster-10.10.20.11}{tXI9LBseSJGrjQo87eJ4lw}{74RGePK5S-CdCz3U35p2ig}{10.10.20.11}{10.10.20.11:9300}{ml.machine_memory=33702772736, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout), reason: removed {{icecluster-10.10.20.11}{tXI9LBseSJGrjQo87eJ4lw}{74RGePK5S-CdCz3U35p2ig}{10.10.20.11}{10.10.20.11:9300}{ml.machine_memory=33702772736, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},}

Actually its very easy to replicate, try pulling out the network cable on a 3 node cluster and simply try to retrieve data or try to look at http://:9200$ip/_cat/indices you will see es will hang.

DavidTurner · January 19, 2019, 10:06am

I think these settings are too high. In particular if /proc/sys/net/ipv4/tcp_retries2 is 15 then it will take well over a minute to detect a dropped connection, during which time all sorts of other requests will be piling up in queues and generally causing trouble. If you reduce this setting to something more reasonable (Red Hat say to reduce it to 3 in a HA situation) then the initial connection failure will be picked up much quicker.

That should be enough for cases where you disconnect a node that isn't the elected master. However if you disconnect the master node then a new master will be elected, and this initial election involves trying to reconnect to the disconnected node, which times out after transport.tcp.connect_timeout, and this happens twice, so with your settings that election takes at least another minute. I think that 30s for the connect timeout is too long for many situations. It's certainly far too long for node-to-node connections, but unfortunately today Elasticsearch doesn't allow setting different timeouts for different kinds of connection so you can only change it for every outbound connection.

Could you reduce these settings to something more appropriate and re-run your experiment? If it is still taking longer than you expect then it would be useful if you could share the full logs from the master node for the duration of the outage so we can start to look at what else is taking so long.

The message you quote indicates that a single stats request timed out, but Elasticsearch cannot tell if this is because of a network issue or because the node was busy (e.g. doing GC) so it doesn't trigger any further actions. The most reliable way to get the cluster to react to a network partition is to drop a connection, and reducing tcp_retries2 is a good way to do that.

Ninad_Pradhan · January 22, 2019, 9:45pm

Its a day and night difference, it worked and its quick.

Thank you for your help.

I changed sysctl -w net.ipv4.tcp_retries2=3 and transport.tcp.connect_timeout to 5s and boom! it worked, no more issues.

system · February 19, 2019, 9:45pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elastic Search working very slow and even could not tell if its rebalancing or not Elasticsearch	2	841	October 9, 2017
Cluster recovery and reachability takes long time when master left Elasticsearch	11	2580	March 19, 2019
Network interruption, some nodes not recovering Elasticsearch	1	366	July 6, 2017
Cluster Hangs for 20 seconds, on a single node crush Elasticsearch	13	921	October 3, 2019
Rebalancing of shards during temporary unavailability of one node Elasticsearch	1	350	July 6, 2017

Shard rebalancing is slow after network failure on any node

Related topics