[ES v7.5.1] Speed up snapshot restore from GCS Snapshot Repository

amitsh728 · July 17, 2021, 1:40pm

Hello,
We are trying to speed up our snapshot restoration speed in our ES cluster hosted on GCP Compute instances.

TL;DR:

Current Performance: 56 MBps per data node
Infra Capable of 500 MBps
We want to improve our restore speed up to the maximum disk throughput (no throttling from infrastructure).

Infra Details:

3 Master | 10 Data Nodes
Each node has 8core/16gb config (heap size: 8gb)
Each Data instance supports max 15k disk IOPS (500 MBps Throughput)

Current Restore Performance: 4.5 Gbps on the whole cluster (56 MBps per node).

We are currently getting 10% of the total disk write throughput. We are looking for options to improve it.

Note: Have already tested any infra-related throttling. Using gsutil -m, we saw the download speed reach to 450 MBps on one of the data nodes.

What we have already tried:

Setting max_restore_bytes_per_sec to 0 in our gcs snapshot repository.
Setting indices.recovery.max_bytes_per_sec to 0.

We are unable to figure out the config that is throttling the network performance.

Update:

We tried to increase the number of data nodes, to check if the throttling is on some data nodes:

Changed Data nodes count from 10 to 20.
Result: Speed still throttled at 4.5 Gbps.

Akshay_KN · July 20, 2021, 10:02am

Hi Team,

Please note the below config changes that we have already tried but not getting the expected speed.

At Cluster Level

indices.recovery.max_bytes_per_sec: "10gb"
indices.recovery.max_concurrent_file_chunks: 5

At Index Level:

"refresh_interval" : "-1",
"merge.scheduler.max_thread_count" : "1"

At Snapshot Repository Level:

"max_restore_bytes_per_sec": "10gb",
"max_snapshot_bytes_per_sec": "10gb"

xeraa · July 20, 2021, 11:52am

If you're not already doing it, I would set the replicas (temporarily) to 0: Restore a snapshot | Elasticsearch Guide [7.13] | Elastic

Also I'm not sure how to read the "Infra Capable of 500 MBps". Isn't that what you're getting with "Current Restore Performance: 4.5 Gbps"?

amitsh728 · July 20, 2021, 12:32pm

Hi @xerrad Thanks for the reply...
For the "replicas (temporarily) to 0": Yes we are setting it (But that's not helping us with the speed).

"Infra Capable of 500 MBps" -> That is for one data node.

To clarify (Using Gbps for all the stats):

Per node, the infra is capable to get 5Gbps.
For 10 nodes, the infra is capable to get 5Gbps x 10.
What we are getting for 10 nodes: 4.5 Gbps.

4.5 Gbps is over 10 nodes. We should be able to get about 50Gbps over 10 nodes (our need is to get at least 25Gbps).

Christian_Dahlqvist · July 20, 2021, 12:49pm

haw many shards are you recovering per node? Have you tried increasing cluster.routing.allocation.node_concurrent_recoveries to increase the level of papallelism as described here?

amitsh728 · July 20, 2021, 1:22pm

Hi @Christian_Dahlqvist Thanks for replying.

We have total 32 shared in our index, and 10 nodes. So I can see that 3 shards are going on one node (apart from 2 nodes, where there are 4 shared).

We did try to set cluster.routing.allocation.node_concurrent_recoveries to 4 on the cluster (Updated elasticsearch.yml file on all nodes + restarted service on all), although that didn't make any difference on the restoration performance.

We also tried the below cluster configs (Just to see if it impacts speed):

  thread_pool.snapshot.core: 4
  thread_pool.snapshot.max: 8
  indices.recovery.max_concurrent_file_chunks: 5
  thread_pool.write.queue_size: 10000
  thread_pool.get.queue_size: 10000
  transport.connections_per_node.reg: 50
  transport.connections_per_node.bulk: 50
  thread_pool.fetch_shard_started.core: 4
  thread_pool.fetch_shard_store.core: 4
  cluster.routing.allocation.node_concurrent_recoveries: 4
  indices.query.bool.max_clause_count: 2048

But we didn't saw any difference.

Christian_Dahlqvist · July 20, 2021, 1:44pm

What is the size of the shards? Does GCS impose any performance limits for this type of data which includes a large number of sometimes small files?

amitsh728 · July 21, 2021, 5:52am

Hi @Christian_Dahlqvist , each shard is about 100GB.

amitsh728 · July 21, 2021, 5:54am

Update:
After trying multiple options, we weren't able to find the root-cause of this.

We decided to use an updated version of ES (version: 7.10.2).

We are now able to reach 33gbps speed with the updated cluster.

Intrestingly, we are using similar configs (similar ansible playbooks) for both the version, but with version 7.10.2, we were able to reach the target speed.

system · August 18, 2021, 5:55am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
[ES v6.7.2] Speed up snapshot restore from GCS Snapshot Repository Elasticsearch snapshot-and-restore	3	817	February 2, 2022
How to increase the speed of restoring snapshots using GCS Elasticsearch	1	316	October 12, 2020
[Solved] Improving Snapshot Recovery Speed Elasticsearch	5	4919	July 5, 2017
Elasticsearch snapshots throttle problems Elasticsearch	3	1722	July 6, 2017
Snapshot & Restore Performance Elasticsearch	1	1550	July 6, 2017

[ES v7.5.1] Speed up snapshot restore from GCS Snapshot Repository

Related topics