Smart rate limiting in a cluster in 2 DC

Jose_E · December 16, 2024, 10:07am

Hi all,

A very tricky question here. Our current setup comprises 6 nodes, 3 in one DC and 3 in another DC, so they work in a kind of high availability scenario, where we always have the replica in the other DC.

So far it's worked great, but it is true we have experienced some network saturation between the 2 DC, so we are looking to reduce the "indices.recovery.max_bytes_per_sec" to 20mb rather than the 40mb by default. So here is my question, is there a smarter way to apply rate-limiting in this scenario? Mainly because we only want to limit the replication directed to the other DC, but not the replication between nodes in the same DC.

At the moment this is how we set the scenario for each node:

cluster.name: elks-cluster
node.attr.data_center: DC1 ### DC2 in case is the other 3 nodes
cluster.routing.allocation.awareness.attributes: data_center

Let me know if you need more information.

Christian_Dahlqvist · December 16, 2024, 10:20am

That is not a highly available deployment. Although you can handle a single node failing you have a problem if you lose a full DC as your cluster will not be operational due to you no longer have a majority of modes available (majority of 6 is 4). To achieve HA you need to add a dedicated master node in a third DC which can be small and even voting-only.

For the cluster to be highly available you need to have enough bandwidth to allow all writes and updates to replicate in real time and make sure at least one primary or replica shard is available in each DC, which means there will be traffic going across.

If you have limitations with respect to bandwidth it may be better to set up 2 separate clusters and index in parallel into these based on e.g. a message queue.

Jose_E · December 16, 2024, 10:30am

It makes sense, we did some testing simulating an outage in one of the DCs and it did work. But is true that it didn't work well when the outage occurred in the secondary DC. In theory we have enough bandwidth for the replicas to work on a daily basis and they work real time, meaning we create a primary shard in DC1 and have a replica in DC2 immediately, it also works the other way around. However, we've noticed a lot of bandwidth usage in situations when one of the DCs has been unavailable and suddenly reconnects, that's when our DC to DC line gets saturated.

Christian_Dahlqvist · December 16, 2024, 10:36am

I assume this means that you have more master eligible nodes in one of the DCs? If that is the case you can handle a single node failing as well as the correct DC failing.

The default for recovery speed is set quite low in order to not overwhelm nodes in the cluster, so if your bandwith can not handle this when you have a large outage I think it could be a problem. You can naturally lower the limit but be aware that that may impact recovery time and leave you exposed for longer.

Jose_E · December 16, 2024, 10:40am

Exactly! Sadly, I'm aware of the problem, that's why I'm asking if there's a way to limit this recovery only to the nodes on the other DC, and not impact the nodes on the same DC that don't have a bandwidth problem.

Christian_Dahlqvist · December 16, 2024, 10:42am

No, I do not think there is. It is generally not recommended to deploy Elasticsearch across DCs where there are latency or throughput constraints beyond what would exist in a single DC.

Jose_E · December 16, 2024, 10:47am

Thanks!! I hoped for a better answer, but you are right in all the points you made. I will speak with our IT department because I think they plan on upgrading the connection, but not sure when, and in the meantime, we can deploy some script that limits the recovery speed only when a certain bandwidth is consumed.

Topic		Replies	Views
Should I use ElasticSearch across two datacenters for convenient data replication? Elasticsearch	2	521	July 6, 2017
Network Speed between ES nodes Elasticsearch	2	1613	July 6, 2017
Multi DC cluster or separate cluster per DC? Elasticsearch	6	1365	July 6, 2017
ElasticSearch Setup to store data from multiple datacenters Elasticsearch	6	399	July 6, 2017
Query internet bandwidth usage? Elasticsearch	10	2913	July 5, 2017

Smart rate limiting in a cluster in 2 DC

Related topics