Background
We embed elasticsearch into our web application, and then let Amazon EC2
control the startup or teardown of our server instances based on load
metrics. The exact same web app is used across all instances, therefore each
new instance brings a new elasticsearch node into existence.
I am trying to bulletproof some scenarios, and just want to ask what the
best practice might be for some of these situations.
Split-Brain
Because there is no way to know how many servers will be running at any
particular time, I am unsure what I might be able to do to recognize
splitbrain scenarios and take servers down who are not part of the "primary"
cluster. My thoughts at the moment are a service that performs these steps:
- Identify all nodes in the EC2 group using EC2 API.
- Perform a cluster health check against all the nodes returned from
this list. - In a healthy cluster, the number of nodes identified in the ES list
will match the EC2 list. - In a split-brain scenario:
- I identify each group of nodes using the ES stats.
- I keep the largest group of nodes running.
- Remove the other nodes from the load balancer using EC2 API.
- I terminate the other nodes using EC2 API.
- Amazon EC2 will bring up new servers to replace the ones I terminated
depending on load requirements.
If this works, it does so only because there is another API (EC2) I can
consult for a definitive list of all nodes in my cluster. I worry that this
process will be too slow because data loss and corruption can happen the
moment one of the split brain nodes is used.
Are there any other approaches I might take?