Cluster recovery and reachability takes long time when master left

DavidTurner · January 25, 2019, 2:34pm

This sounds similar to the discussion here:

Reduce net.ipv4.tcp_retries2 and the connection timeout and you should see improvements. Even 10-20 seconds sounds like a long time for the cluster to recover with those settings set appropriately, and I'd be interested to see logs from a recovery that did take that long.

Edit: it looks like you have reduced the ping timeout to 4s in Elasticsearch, which will help it detect the connection drop a little quicker but can harm your cluster stability since it will remove nodes from the cluster if they pause for a few seconds of GC. It's much better to detect the connection drop with net.ipv4.tcp_retries2 since this is independent of GC.

Topic		Replies	Views
Simulating network connect failure between nodes Elasticsearch	2	663	January 1, 2017
Cluster failures Elasticsearch	2	284	July 6, 2017
Cluster stopped working but was working fine Elasticsearch	8	766	November 23, 2018
Long period of querying failure during node timeout Elasticsearch	4	1039	May 15, 2020
ES 6.0 timeout on cluster Elasticsearch	9	1160	January 18, 2018

Cluster recovery and reachability takes long time when master left

Related topics