I understand that it is recommended to have 2-3 zones for production cluster.
the main documented reason is that in case entire site has an issue then It would have a redundant fail-over site.
also , I know that for defining 3 dedicated master I must use the "master per zone" checkbox.
but let say I dont expect ALL Zone to be down , and I can setup each node as master eligible , are there any other factors I did not think of for choosing between 1 or 3 zones ?
my main issue started by having 5 servers to ece, so , if I have 3 zones , then I have : Zone1 - 2 servers Zone2 - 2 servers and Zone3 - one server. (unbalanced )
scaling up is always by multiplication of 3 , and I might have issue in zone3.
if I have all 5 servers in single zone, then I can scale up better and use resources better....
One important note to make clear is that ECE is meant to be installed in a single DC, you can think about it in a similar way you would about a GCP, AWS, Azure, etc. regions. Where you would provision VMs in different availability zones (AZ), eg. us-east1, us-east2, etc. unlike cross site fault tolerance which might indicate this setup should be installed across multiple regions.
As for importance of having 3 AZs, currently it's mandatory in order to provide highly available setup for both 2 and 3 AZs. The reason that ECE will need to use all 3 zones, even if you want to use 2 AZs for your data nodes, is to spin a no-data node in the third zone that is only used to avoid a split brain scenario. We are working on improving this and natively support 2 AZs removing the need for a "fake" third zone.
With regards to keeping a balanced environment, currently ECE will try to optimise usage of the available resources and will attempt to fill allocators, if possible, when deploying new nodes before deploying to a new allocator. So if you have a mix of clusters with different sizes and different AZs configuration is should be less of an issue. You will also soon be able to tag allocators and have more control on which allocator a node will be deployed which should also allow you to make sure deployments are done evenly.
Having said that, your feedback is very much relevant and we will see what can be done in the future to make sure this aligns with our best practices.
Hope this helps.
Thank you Roy ,
if you can please add just this clarification to my following question, that would be appreciated :
let say I decide to have only one zone , having all nodes as master eligible.
what "fault tolerance" logic will I lose here ?
in case node fail it will use the replicas reside on other nodes.
in case a host fails (and I set the cluster.routing.allocation.same_shard.host =true)
then also , all replicas are on other hosts & nodes.....
both allowing cluster to recover.
so what is the "extra" logic when using 3 zones ?
Sure, I'll try.
The idea that an Availability Zone will normally be used as a means for a stronger separation than simply using a different virtual or physical machine (as you describe in the second scenario), or in case of Elasticsearch have multiple nodes with replica configured (as you describe in the first scenario). It can either represent different racks (in big DC different rooms or buildings), connected to different switches, etc.
It doesn't protect against and entire site, or DC, or region going down, for that you will need cross site replication which can be achieved in other means (using Logstash to fork and index to two different sites, or use kafka to replicate across sites, etc) and this is also something we working to support natively in the Elastic Stack.
We recommend that you install allocators on different machines, and map those to the relevant zones so when deploying a cluster with more than 1 AZ ECE will make sure to deploy the nodes to allocators in different AZs, and this in turn will allow tolerance in case an entire rack goes down which can potentially include all the nodes in your clusters if not configured correctly.
Hope this helps.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.