Have startedin a new position where we currently have a two Data Center cluster with 5 master/data nodes + 1 master-only node and indices having replication of 1.
Wondering if we better ought to do few (+3) dedicated master-only nodes and data-only nodes and how to best spread such among two DCs and what number of nodes to use for both master as well as data nodes in each DC and if we should use replication > 1.
Would we need prefer one DC over the other or better use 3 DC to allow to 'loose' one DC and still have quorum.
As is, I can't see any advantage in using two DCs, on the contrary complicating things plus higher latency
If you are looking for high availability in case of a full data center failure you need 3 zones. You can put a master node in each DC and a master only node (possibly voting only) in a third DC. If you do not have this you will only be able to handle the data center with minority of master eligible nodes failing.
Yes in the case cluster should be able to handle loosing one DC, we'll need to have more data nodes and configure shard allocation awareness, which is currently not the case. Dunno why someone initially spread over two DCs and what objective they were aiming at. Three zones would be to prefer if aiming for high availability across zones.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.