We have a 7-node cluster. 3 dedicated master nodes, 4 data/ingest nodes. We had an issue where a host that was housing our VMs failed, which caused us to lose one master node and two data nodes. Our cluster was still "available," but we decided to flip over to our rescue cluster.
However, it took a long time to resolve this issue (around 5 minutes of search downtime on our site) because our ping timeout was too long. We had it set to the default 30s, but that caused us to not identify this issue for a while. We have since lowered it to 10s, but we're wondering if (given our network capacity and cluster size) we could go even lower?
Thanks!