We have around 150 nodes (Data size:14TB) in our elastic-search cluster and want to take snapshot of our data using an s3 compatible service.
We want to restrict the overall bandwidth being used the by the cluster for snapshot process to 300 MBPS. The only way in which we could do that was find out the number of nodes ('x' nodes) which will actually participate while backing up the cluster and divide 300/'x' and set max_snapshot_bytes_per_sec as this value.
However there is a downside to this setting, during the final stages of snapshot most of the nodes have finished pushing data and only few nodes are remain making the final stage very slow. As an example while taking backup of around 2TB of data (being pushed from 112 nodes), the last 85 gb of data was being pushed from 3-4 machines and took around 6-7 hours to finish, because of the bandwidth limitation set per node.
To avoid such condition, we removed max_snapshot_bytes_per_sec and set buffer_size as 8mb and retry count as 3 with throttle as true and applied a QPS limit of 40 at our s3 compatible service. Post 40 QPS, the service sends a 504 slowdown error code. Now while taking backups, sometimes it happens successfully, however at certain time some shards don't get backed up and in the error message we see 504 slowdown for that particular shard.
What is the best way to take backups in our case with the limitation that we shouldn't use more than 300MBPS of network bandwidth?