Hello all,
Here's another problem with our cluster I cannot understand. When a node (node1) drops from the cluster the other nodes continue trying to use node1 as master. For some reason the master reelection is not taking place. And when node1 is restarted the cluster does not heal, the other nodes continue with error messages such as "failed to send health to master node node1, node not connected".
This started because one of our data nodes (node6) had an OOM issue and killed the ES process. (The OOM was likely due to too many recoveries occurring while indexing). After restarting the node6 it failed to discover the master. Additionally, any queries to the rest of the cluster failed. I decided to restart the active master node (node1) to trigger rediscovery but this did not work as expected. Now it seems once again a full cluster restart is the only option to recover.
I'm not sure what is wrong. The config is fine and has worked for ages. IPTABLES allows all cluster traffic. There are 8 nodes, 6 are data and 2 are ingest only. 3 of the data nodes are set to master. Minimum master nodes is set to 2. Zen discovery is using unicast.
This is ElasticSearch 5.6.3.