[ES v6.7.2] Speed up snapshot restore from GCS Snapshot Repository

Hello Team,

We are trying to speed up our snapshot restoration speed in our ES cluster hosted on GCP Compute instances.

TL;DR :

  • Current Performance: 50 MBps per data node
  • Infra Capable of 500 MBps
  • We want to improve our restore speed up to the maximum disk throughput (no throttling from infrastructure).

Infra Details :

  • 5 Master | 7 Data | 2 Co-ordinator Nodes
  • Each node has 16core/32gb config (heap size: 16gb)
  • Each Data instance supports max 25k disk IOPS (500 MBps Throughput)

Current Restore Performance: 3 Gbps on the whole cluster (50 MBps per node).

We are currently getting 10% of the total disk write throughput. We are looking for options to improve it.

What we have already tried:

  • cluster.routing.allocation.node_concurrent_recoveries: 30
  • cluster.routing.allocation.node_initial_primaries_recoveries: 30
  • indices.recovery.max_bytes_per_sec: 20gb
  • indices.recovery.max_concurrent_file_chunks: 5
  • thread_pool.bulk.queue_size: 2000
  • thread_pool.bulk.size: 16
  • thread_pool.index.queue_size: 2000
  • thread_pool.index.size: 16
  • thread_pool.snapshot.core: 10
  • thread_pool.snapshot.max: 50
  • transport.connections_per_node.recovery: 10

Tested the restoration speed on the below ES versions

  • 6.8.12 - No improvement in the speed.
  • 7.10.2 - Massive improvement in the speed. 28Gbps speed for 7 data node cluster.

We are unable to figure out the config that is throttling the network performance.

This is the best solution IMO: upgrade to a newer version. 6.7 is nearly 3 years old and well over a year past EOL, so it's no longer supported and you're missing out on several years of performance improvements by using such an ancient version.

You should restore these values back to the defaults. The values you suggest can make performance worse and can even lead to cluster instability. Increasing indices.recovery.max_bytes_per_sec is acceptable but the value you use should not exceed your actual disk throughput.

I forget the details of snapshotting in such old versions but in newer versions the repository setting max_restore_bytes_per_sec defaults to 40mb so it might help to increase this too. Even if it does help, you should still upgrade as a matter of some urgency.

1 Like

Thank you for the updates.

We will restore the values back to the defaults.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.