Master node discovery is not working

Hello all,

Here's another problem with our cluster I cannot understand. When a node (node1) drops from the cluster the other nodes continue trying to use node1 as master. For some reason the master reelection is not taking place. And when node1 is restarted the cluster does not heal, the other nodes continue with error messages such as "failed to send health to master node node1, node not connected".

This started because one of our data nodes (node6) had an OOM issue and killed the ES process. (The OOM was likely due to too many recoveries occurring while indexing). After restarting the node6 it failed to discover the master. Additionally, any queries to the rest of the cluster failed. I decided to restart the active master node (node1) to trigger rediscovery but this did not work as expected. Now it seems once again a full cluster restart is the only option to recover.

I'm not sure what is wrong. The config is fine and has worked for ages. IPTABLES allows all cluster traffic. There are 8 nodes, 6 are data and 2 are ingest only. 3 of the data nodes are set to master. Minimum master nodes is set to 2. Zen discovery is using unicast.

This is ElasticSearch 5.6.3.

I couldn't figure this out and assumed it might have been a bug triggered by an unsafe shutdown (the OOM kill).

It seemed a good time to upgrade to 6.1.3 so I've done that. This thread is probably no longer relevant due to the upgrade.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.