Cluster recovery and reachability takes long time when master left

This sounds similar to the discussion here:

Reduce net.ipv4.tcp_retries2 and the connection timeout and you should see improvements. Even 10-20 seconds sounds like a long time for the cluster to recover with those settings set appropriately, and I'd be interested to see logs from a recovery that did take that long.

Edit: it looks like you have reduced the ping timeout to 4s in Elasticsearch, which will help it detect the connection drop a little quicker but can harm your cluster stability since it will remove nodes from the cluster if they pause for a few seconds of GC. It's much better to detect the connection drop with net.ipv4.tcp_retries2 since this is independent of GC.

1 Like