Shard rebalancing is slow after network failure on any node

we are using elastic search 6.3.1, we have a 3 node cluster 1 tb ssd, 30 gb ram on the machine, 8gb allocated to es, 2 masters minimum, in our testing if one of the es node's network goes down, elastic takes about 7 to 10 minutes to be responsive again and the indices take that long to become green again. However if you just stop the es service on one of the machine, then we dont see re-balancing take that long. Has anyone else faced this issue.

There are our setting.... icecluster icecluster-
node.master: true true
node.ingest: false /data
path.logs: /var/log/elasticsearch
http.port: 9200 [,,]
discovery.zen.minimum_master_nodes: 2
queue_size: 1000000
queue_size: 1000000

Why do you have such large queue sizes? How many indices and shards do you have in the cluster?

73 indices and about 330 total shards in the cluster, we have realtime data that we very high frequency queries, anyways Ive tried dropping the queue size to default and has the same issue.

Also the issue is that not only does the status not become green, but data is inaccessible for almost 10 mins, however, if you simply do service elasticsearch stop on one of the nodes it behaves just fine, the issue occurs only if network cable is plugged out on one node or interface is shut down.

This could be a network configuration issue: it sounds like your network isn't reporting a failure fast enough. What do these commands output?

$ cat /proc/sys/net/ipv4/tcp_retries2
$ cat /proc/sys/net/ipv4/tcp_syn_retries

Also it looks like transport.tcp.connect_timeout is set to the default, which is a rather conservative 30 seconds. This applies to all outbound connections that Elasticsearch tries to make, including node-to-node connections. You might want to reduce this.

Possibly this is related to #29025.

1 Like

Its cat /proc/sys/net/ipv4/tcp_retries2

cat /proc/sys/net/ipv4/tcp_syn_retries
yes transport.tcp.connect_timeout is 30s


on the logs of the active nodes I see a message instantly suggesting the network failure was recognized in-fact.

[2019-01-18T11:29:08,672][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [icecluster-] failed to execute on node [tXI9LBseSJGrjQo87eJ4lw]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [icecluster-][][cluster:monitor/nodes/stats[n]] request_id [179994] timed out after [15001ms]
at org.elasticsearch.transport.TransportService$ [elasticsearch-6.3.1.jar:6.3.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ [elasticsearch-6.3.1.jar:6.3.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker( [?:1.8.0_191]
at java.util.concurrent.ThreadPoolExecutor$ [?:1.8.0_191]
at [?:1.8.0_191]
[2019-01-18T11:30:04,445][INFO ][o.e.c.r.a.AllocationService] [icecluster-] Cluster health status changed from [GREEN] to [YELLOW] (reason: ).
[2019-01-18T11:30:04,445][INFO ][o.e.c.s.MasterService ] [icecluster-] zen-disco-node-failed({icecluster-}{tXI9LBseSJGrjQo87eJ4lw}{74RGePK5S-CdCz3U35p2ig}{}{}{ml.machine_memory=33702772736, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout), reason: removed {{icecluster-}{tXI9LBseSJGrjQo87eJ4lw}{74RGePK5S-CdCz3U35p2ig}{}{}{ml.machine_memory=33702772736, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},}

Actually its very easy to replicate, try pulling out the network cable on a 3 node cluster and simply try to retrieve data or try to look at http://:9200$ip/_cat/indices you will see es will hang.

I think these settings are too high. In particular if /proc/sys/net/ipv4/tcp_retries2 is 15 then it will take well over a minute to detect a dropped connection, during which time all sorts of other requests will be piling up in queues and generally causing trouble. If you reduce this setting to something more reasonable (Red Hat say to reduce it to 3 in a HA situation) then the initial connection failure will be picked up much quicker.

That should be enough for cases where you disconnect a node that isn't the elected master. However if you disconnect the master node then a new master will be elected, and this initial election involves trying to reconnect to the disconnected node, which times out after transport.tcp.connect_timeout, and this happens twice, so with your settings that election takes at least another minute. I think that 30s for the connect timeout is too long for many situations. It's certainly far too long for node-to-node connections, but unfortunately today Elasticsearch doesn't allow setting different timeouts for different kinds of connection so you can only change it for every outbound connection.

Could you reduce these settings to something more appropriate and re-run your experiment? If it is still taking longer than you expect then it would be useful if you could share the full logs from the master node for the duration of the outage so we can start to look at what else is taking so long.

The message you quote indicates that a single stats request timed out, but Elasticsearch cannot tell if this is because of a network issue or because the node was busy (e.g. doing GC) so it doesn't trigger any further actions. The most reliable way to get the cluster to react to a network partition is to drop a connection, and reducing tcp_retries2 is a good way to do that.


Its a day and night difference, it worked and its quick.

Thank you for your help.

I changed sysctl -w net.ipv4.tcp_retries2=3 and transport.tcp.connect_timeout to 5s and boom! it worked, no more issues.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.