Treating your Cluster like cattle instead of pets

Anyone who has managed large server farms soon realizes that treating your servers like cattle instead of pets is the most efficient way to deploy, run, decommission nodes quickly and efficiently. If a server fails you, you take it out back and humanely shoot it and get another one.

What I'd like to know is if anyone has created documentation on how to efficiently create a larger Elasticsearch cluster and any best practices involved. For instance, for the configuration, does it make sense to use ${HOSTNAME} for the node name instead of editing a bunch of configuration files for a lot of nodes. I realize you will probably need to modify generic configs for data nodes, ingest nodes, etc.

What I'm looking for ultimately is documentation that discusses current best practices on how to launch a larger cluster in the most efficient manner possible. Let's say you purchase a dozen servers in a COLO and you want to create a cluster with them. Many will be data nodes, some will be search nodes, etc. Let's just pick an OS to get started -- Ubuntu Server 22.04. What sysadmin tools would you use to quickly provision a new cluster without having to worry about special treatment for each node?

Thank you!

Most people seem to be using k8s with ECK for this style approach.

1 Like

For instance, for the configuration, does it make sense to use ${HOSTNAME} for the node name…

It makes sense if you're happy to find yourself looking a list of node names and not know what each node does. :wink: On the set up I work with we have node names comprised of the hostname, the function, and where applicable the data centre. E.g. cold_dc1_foo for a server called foo which is a cold data node in data centre one. (Data nodes are split between data centres for resiliency.) Or master_bar for a server called bar which is a dedicated mater node that can exist in either data centre. In Kibana monitoring I can filter the node list down to just cold data nodes by putting cold_ in the filter field.

…instead of editing a bunch of configuration files for a lot of nodes.

Even if you don't bother specifying node names you're going to have to customise Elasticsearch cofiguration on every node anyway so specify things like the node roles.

But if you're editing configuration files you're arguably doing it the hard way. Configuration management is the way to go. Don't build a cluster by logging in to servers and editing files, build configuration management which will put a server in to the desired state. If you "shoot" a node because it has failed you, you can tell the configuration management to make a new server a cold node. And keep all your configuration management in version control so that you can roll back things which turn out to be a bad idea.

Use cloud-formation or terraform to manage instances (masters) and auto-scaling-groups that launch instances (data, client, etc -- make sure to suppress terminate signals on data nodes to prevent data loss).

Use ansible or chef or whatever configuration management software you want to install and configure elasticsearch and support packages, as well as lock down the OS per your corporate requirements.

Personally I use packer to build pre-configured ami's. When my ansible changes the ES config somehow (version, config settings, etc), I first rerun packer to build new ami's. My terraform is configured to look up the ami's based on a specific naming convention, so I just apply that terraform to update the ASG's. I use terraform apply --target to rotate (terminate and replace) masters one at a time.

Yes, I could just run the ansible directly against the existing masters, but this way I know that someone else on the team can follow a playbook to terminate and replace a sick master.

And I take SPECIAL care doing anything with data nodes. It takes about 3 days to evacuate indexes off a single data node, so I have to plan things ahead more than usual when I need to rotate data nodes (reserved for exceptional cases).

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.