Shard rebalancing is slow after network failure on any node

I think these settings are too high. In particular if /proc/sys/net/ipv4/tcp_retries2 is 15 then it will take well over a minute to detect a dropped connection, during which time all sorts of other requests will be piling up in queues and generally causing trouble. If you reduce this setting to something more reasonable (Red Hat say to reduce it to 3 in a HA situation) then the initial connection failure will be picked up much quicker.

That should be enough for cases where you disconnect a node that isn't the elected master. However if you disconnect the master node then a new master will be elected, and this initial election involves trying to reconnect to the disconnected node, which times out after transport.tcp.connect_timeout, and this happens twice, so with your settings that election takes at least another minute. I think that 30s for the connect timeout is too long for many situations. It's certainly far too long for node-to-node connections, but unfortunately today Elasticsearch doesn't allow setting different timeouts for different kinds of connection so you can only change it for every outbound connection.

Could you reduce these settings to something more appropriate and re-run your experiment? If it is still taking longer than you expect then it would be useful if you could share the full logs from the master node for the duration of the outage so we can start to look at what else is taking so long.

The message you quote indicates that a single stats request timed out, but Elasticsearch cannot tell if this is because of a network issue or because the node was busy (e.g. doing GC) so it doesn't trigger any further actions. The most reliable way to get the cluster to react to a network partition is to drop a connection, and reducing tcp_retries2 is a good way to do that.

2 Likes