A very tricky question here. Our current setup comprises 6 nodes, 3 in one DC and 3 in another DC, so they work in a kind of high availability scenario, where we always have the replica in the other DC.
So far it's worked great, but it is true we have experienced some network saturation between the 2 DC, so we are looking to reduce the "indices.recovery.max_bytes_per_sec" to 20mb rather than the 40mb by default. So here is my question, is there a smarter way to apply rate-limiting in this scenario? Mainly because we only want to limit the replication directed to the other DC, but not the replication between nodes in the same DC.
At the moment this is how we set the scenario for each node:
cluster.name: elks-cluster
node.attr.data_center: DC1 ### DC2 in case is the other 3 nodes
cluster.routing.allocation.awareness.attributes: data_center
That is not a highly available deployment. Although you can handle a single node failing you have a problem if you lose a full DC as your cluster will not be operational due to you no longer have a majority of modes available (majority of 6 is 4). To achieve HA you need to add a dedicated master node in a third DC which can be small and even voting-only.
For the cluster to be highly available you need to have enough bandwidth to allow all writes and updates to replicate in real time and make sure at least one primary or replica shard is available in each DC, which means there will be traffic going across.
If you have limitations with respect to bandwidth it may be better to set up 2 separate clusters and index in parallel into these based on e.g. a message queue.
It makes sense, we did some testing simulating an outage in one of the DCs and it did work. But is true that it didn't work well when the outage occurred in the secondary DC. In theory we have enough bandwidth for the replicas to work on a daily basis and they work real time, meaning we create a primary shard in DC1 and have a replica in DC2 immediately, it also works the other way around. However, we've noticed a lot of bandwidth usage in situations when one of the DCs has been unavailable and suddenly reconnects, that's when our DC to DC line gets saturated.
I assume this means that you have more master eligible nodes in one of the DCs? If that is the case you can handle a single node failing as well as the correct DC failing.
The default for recovery speed is set quite low in order to not overwhelm nodes in the cluster, so if your bandwith can not handle this when you have a large outage I think it could be a problem. You can naturally lower the limit but be aware that that may impact recovery time and leave you exposed for longer.
Exactly! Sadly, I'm aware of the problem, that's why I'm asking if there's a way to limit this recovery only to the nodes on the other DC, and not impact the nodes on the same DC that don't have a bandwidth problem.
No, I do not think there is. It is generally not recommended to deploy Elasticsearch across DCs where there are latency or throughput constraints beyond what would exist in a single DC.
Thanks!! I hoped for a better answer, but you are right in all the points you made. I will speak with our IT department because I think they plan on upgrading the connection, but not sure when, and in the meantime, we can deploy some script that limits the recovery speed only when a certain bandwidth is consumed.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.