I've recently been having trouble with a 20 node ES cluster where every
other night one or more nodes will drop out of the cluster due to failed
pings. Typically it's one or two nodes, always the same one or two, and
roughly at the same time of night. At time these nodes are dropped the
master seems to be doing a lot of merges and, presumably related, it has
very high disk IO (close to 100%). The nodes that are dropped also tend to
be busy with IO but not nearly as much as the master.
It's also worth noting that the master logs show the debug output "using
[concurrent] merge scheduler with max_thread_count" on the nights when a
problem occurs at roughly the time of the failures. It's not clear to me
what that message indicates or why it chooses to display the message then
since presumably the scheduler has been responsible for any prior merges
that day. Is it getting reset? Is it starting some kind of larger merge?
Seems very coincidental.
So, any thoughts on what might be going wrong or how to address it? I
could just extended the failure detection ping timeouts but that seems like
it's hiding the symptom without tracking down the cause.