Using gateway.recover_after_data_nodes to minimize recovery time in an Azure IAAS environment


The question is: if the number of data nodes in the cluster is < than gateway.recover_after_data_nodes, will the cluster go red and/or stop accepting writes? (I can test this, I know, but it requires a full cluster restart, which takes a loooong time.)

Context/reason for asking:

We run a cluster in Azure (IAAS). We have configured update domains in Azure and theoretically should not be impacted by their updates. However, we're finding that if the amount of time between updating VMs or VM hosts is not sufficient for reallocating/rebalancing, our cluster can go red and the recovery process can take a long time because nodes are being restarted while we're accepting writes.

We're thinking of ways to mitigate this. Currently, we have gateway.recover_after_data_nodes set to n-1 (we have 11 data nodes, so it's set to 10). We're thinking that if the cluster goes red (and/or stops accepting writes) if the number of data nodes in the cluster is < than gateway.recover_after_data_nodes set in the yml, that may reduce the recovery time. Any thoughts / input on this is appreciated.

We're also working with the Azure folks focused on elasticsearch for notification of such updates, and additional options.


This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.