This sounds similar to the discussion here:
Reduce net.ipv4.tcp_retries2
and the connection timeout and you should see improvements. Even 10-20 seconds sounds like a long time for the cluster to recover with those settings set appropriately, and I'd be interested to see logs from a recovery that did take that long.
Edit: it looks like you have reduced the ping timeout to 4s
in Elasticsearch, which will help it detect the connection drop a little quicker but can harm your cluster stability since it will remove nodes from the cluster if they pause for a few seconds of GC. It's much better to detect the connection drop with net.ipv4.tcp_retries2
since this is independent of GC.