We had an incident today with Elastic and High Availability was a total failure for 4 minutes when 1 of the nodes died, causing rather large issues on our end as all search queries failed for any indexes with the primary shard on the node that failed.
The cluster has 4 data nodes, 2 query nodes.
discovery.zen.minimum_master_nodes is set to 2 and all nodes are node.master true.
All shards have 1 replica, so when 1 data node died, why did all search queries fail for any indexes where the primary shard was on the affected node?
Do you mean you have 6 master-eligible nodes? If so, you must set discovery.zen.minimum_master_nodes: 4. Your cluster is at serious risk of data loss and all sorts of other weird issues if you set it too low.
The difference between primary and replica is not relevant to searches; moreover one replica will immediately be promoted to primary when the node holding the current primary leaves the cluster.
Apart from that, there's not much to go on here. Can you share more details?
I can share more details, just let me know what you require.
Also, just spoke to Max Bashlawi and we are now looking at getting someone in to consult and review our infrastructure / config as we feel it needs to be fine tuned more accurately.
Sorry I forgot to add a link to the relevant manual page, but yes the one you linked is the right one.
I'm sure the consulting team will do a fine job, and it's probably best for me to leave further investigation to them. Their job will be a good deal easier if you can share with them all the relevant logs and any other evidence you can gather from the time of your outage.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.