Cluster High Availability

We had an incident today with Elastic and High Availability was a total failure for 4 minutes when 1 of the nodes died, causing rather large issues on our end as all search queries failed for any indexes with the primary shard on the node that failed.

The cluster has 4 data nodes, 2 query nodes.
discovery.zen.minimum_master_nodes is set to 2 and all nodes are node.master true.

All shards have 1 replica, so when 1 data node died, why did all search queries fail for any indexes where the primary shard was on the affected node?

Do you mean you have 6 master-eligible nodes? If so, you must set discovery.zen.minimum_master_nodes: 4. Your cluster is at serious risk of data loss and all sorts of other weird issues if you set it too low.

The difference between primary and replica is not relevant to searches; moreover one replica will immediately be promoted to primary when the node holding the current primary leaves the cluster.

Apart from that, there's not much to go on here. Can you share more details?

Thank you David.

I can share more details, just let me know what you require.

Also, just spoke to Max Bashlawi and we are now looking at getting someone in to consult and review our infrastructure / config as we feel it needs to be fine tuned more accurately.

Just did a RTFM and found what you are referring to: https://www.elastic.co/guide/en/elasticsearch/reference/6.8/discovery-settings.html#minimum_master_nodes

(master_eligible_nodes / 2) + 1

I will revise our clusters immediately.

1 Like

Sorry I forgot to add a link to the relevant manual page, but yes the one you linked is the right one.

I'm sure the consulting team will do a fine job, and it's probably best for me to leave further investigation to them. Their job will be a good deal easier if you can share with them all the relevant logs and any other evidence you can gather from the time of your outage.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.