Doubts on Huge Elastic Clustering and HA

Hello mates.

Today I got involved in a project in which we are supposed to provide high availability on an elasticsearch cluster. This is going to be a huge project, meaning this that resources are not a problem (We are talking on the TB of RAM scale). Taking this into account, what do you think would be the proper way to achive de HA? Do you know if there is any documentation available on this topic?

The HA would be on double way:

  • Elastic must be accesible full time.
  • Que must preserve BackUps of the data.

My main doubts and asumptions based on what I read are:

  • Bigger and fewer (Max. 64GB) nodes are better than smaller and numerous.
  • ¿Big Machine(s) handling multiple nodes or one machine per node? (This would be VMs in Proxmox)
  • For the BackUps, do you think is better to back up the hole machine(Snapshots) or the Elastic DB?
  • For the 24/4 access. I guess I would need any kind load balancer achieve the HA in case one cluster fails. Is this right?

If you have any kind of advise before we start with this project, feel free to tell, I would apretiate it a lot.

Thank you in advantage.

I run a very very large elasticsearch cluster. Here are a few of my findings when running 5 9s availability for the data.

The more nodes you run, the larger the cluster state will be that will need to be handled by the master nodes. The more nodes you run, the more inner cluster communication which means a higher chance of timeouts.

Separate out the roles of each node, master, coordinating, data, ingest... This will keep the load distributed and failure due to aggregations on a data node become less likely.

The more shards you have in your cluster, you will run into a higher chance of timeouts when making setting changes across multiple indices (ex. when performing cluster maintenance)

Multiple nodes (to a point) not exceed the 32bit object pointer memory limit (I found this to be 30.5GB JVM in my case), while keeping the other half+ of the machines memory for cache

Use index templates and define the mapping ahead of time for each index

During off peak hours merge segments to fewer segments. This will reduce memory consumption per shard.

Run multiple Kibana instances in front of a load balancer with health checking, pointing to multiple load balanced coordinating only nodes

Have adequate monitoring on every aspect of your cluster, perform small scale testing on ingest, data, jvm heap, settings, queries, make one change at a time and compare the differences.

For Backups, We use snapshots and snapshot each index individually due to retention requirements. We store the snapshots in HDFS (although there are other options).

Utilize a configuration management solution for installing, configuring elasticsearch (we use chef).

Be prepared for some hiccups along the way until your tuning is complete (it never is).

Try to stay as up to date as possible with releases, and when you find a change that will benefit your installation/use case it will make upgrades much easier when required.

2 Likes

That's going to cost you a lot of money. Not only do you have to worry about Elasticsearch, you need to look at hardware - compute, network, power, DC - people, software - OS, network, and more.

@Adrian_Varona we recently added to the reference manual some general guidance on achieving high availability in an Elasticsearch cluster: https://www.elastic.co/guide/en/elasticsearch/reference/current/high-availability-cluster-design.html

This doesn't address your specific questions (the answer to most of which are "it depends" unfortunately) but I hope it's of help.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.