Running ES on the Amazon EC2 container service

Hi,

I plan to run Elasticsearch cluster with the AWS ECS service. All ES nodes will run in docker containers and the ECS service will ensure that when an ES node fails, a new one is started in a new docker container on the same EC2 instance to replace it.

In this setup is looks like I don't need master eligible nodes anymore as the fallen master will be replaced by a new one within minutes. The metadata from the previous master will still be on the instance and the new master will be able to use it straight away. Will this work as I expect?

I also plan to take a regular snapshots from the current master, so that when the EC2 instance fails, I will be able to restore it and thus the new master that will run on it, will have the metadata from the old one.

In general it seems that there is no need for more than one master eligible node, when Elasticsearch is used with the EC2 container service. Please correct me if I'm wrong.

Thanks in advance!

I don't have any personal experience with the setup you described but yes, that's certainly doable. You will have longer downtimes though than if you used other masters on stand-by (Failover is then only a few seconds). Depending on how often you do snapshot from the current master, you can also run into the situation where you don't have the latest state anymore, effectively undoing some of the changes in the cluster.

This seems like you're over complicating things. Why are you taking this route?

Can you tell whether there is a way to force a master eligible data node, that become both a master and a data nod, after the first master failed, to function only as a data node again? I don't want to have nodes performing both master and data functions.

Also if the cluster is left without a master and there is no more master eligible nodes, what will the cluster do? Will it just wait for a new master to come in and will it be able to continue working after the new master joins?

Will the new master be able to join and work with a cluster which doesn't have any master eligible nodes and lost its original master, if this new master doesn't have any metadata about the state the cluster?

Well I want to use the ECS service mechanism, which will always make sure that the fallen docker containers are always replaced with new ones. At the same time I don't want ending up with nodes that run both as master and data. To me this sounds like a simple route?

Why would an ES node fail? How will blindly replacing that help fix the problem?

You absolutely do need master eligible nodes, your cluster will not function without them!

Sure, if you and your users like waiting for your monitoring and deployment process to catch the failure, spin up a new node, join the cluster and then for thins to recover.

So just run some small master only nodes that are separate.

no, that is not dynamically configurable. master eligibility / data eligibility needs to be configured at node startup.

The cluster will be read-only, i.e. searches will still work (as long as no other failures happen).
When the new master joins, it will resume accepting writes.

no

Thank you for your answers!

This is what the AWS ECS service is supposed to do. It will spawn a new docker container, which will in turn use the data that was left there from the container that failed. This will happen is minutes and the new container will look exactly as the old one or at leasts that is what is suppose to happen.

What is the procedure when this happens? Once a data node becomes both master and data how do I split them again? I don't want nodes with double roles. Is it only possible to prevent this by having a spare master eligible node, which is also configured as data.node=false?

it's a situation which is not really supported (changing data-eligibility / master-eligibility of a node) so there is no clear procedure for it.

The AWS ECS service helps you bring back a node faster but that does not automatically give you high availability for an ES cluster (which seems to be the goal here).

Having the the cluster as read-only for a while, until the fallen container is replaced by ECS will be okay since I plan to use a broker before sending to ES.

On the other hand ending up with nodes with double roles will mess up my cluster to a point from which I cannot easily recover (I mean there is no going back to the original cluster setup).

Having a spare master node without being a data node will increase the cost. Therefore I assume that having one master eligible node (which also will be the master) and several data, plus one client node all running in docker containers on AWS ECS will do.

In this case fallen containers will be replaced by the ECS service and fallen master instance I would be able to restore from the daily EBS snapshots that I plan to take.

In the worst scenario the cluster will be read-only until new master is restored. No data will be lost, because once the master fails, the cluster will be read-only and the broker will aggregate and keep the logs until the cluster have it's master back.

On the positive side though I will be sure that the cluster roles won't change (and therefore can plan for dedicated EC2 instance types/size for each role) and I won't have to pay for spare master instance.