Hello Team,
We are trying to speed up our snapshot restoration speed in our ES cluster hosted on GCP Compute instances.
TL;DR :
- Current Performance: 50 MBps per data node
- Infra Capable of 500 MBps
- We want to improve our restore speed up to the maximum disk throughput (no throttling from infrastructure).
Infra Details :
- 5 Master | 7 Data | 2 Co-ordinator Nodes
- Each node has 16core/32gb config (heap size: 16gb)
- Each Data instance supports max 25k disk IOPS (500 MBps Throughput)
Current Restore Performance: 3 Gbps on the whole cluster (50 MBps per node).
We are currently getting 10% of the total disk write throughput. We are looking for options to improve it.
What we have already tried:
- cluster.routing.allocation.node_concurrent_recoveries: 30
- cluster.routing.allocation.node_initial_primaries_recoveries: 30
- indices.recovery.max_bytes_per_sec: 20gb
- indices.recovery.max_concurrent_file_chunks: 5
- thread_pool.bulk.queue_size: 2000
- thread_pool.bulk.size: 16
- thread_pool.index.queue_size: 2000
- thread_pool.index.size: 16
- thread_pool.snapshot.core: 10
- thread_pool.snapshot.max: 50
- transport.connections_per_node.recovery: 10
Tested the restoration speed on the below ES versions
-
6.8.12
- No improvement in the speed. -
7.10.2
- Massive improvement in the speed. 28Gbps speed for 7 data node cluster.
We are unable to figure out the config that is throttling the network performance.