Hi there,
I am looking for some advice on improving the performance of restoring large backup from GCS repository. In my test, I am restoring about 400TB data from the GCS bucket to the 50 nodes elasticsearch cluster running on 7.17.
Each node has 16 core cpus, 64GB memory and equipped with local SSD with high throughput (up to 1,400Mbps) and 10Gbps( 1,250Mbps) network interfaces. Heap size is 31GB.
I have made the following configuration before issuing the restore operation
-
set the static setting
thread_pool.snapshot.max: 15 -
set the dynamic settings
"cluster" : {
"routing" : {
"allocation" : {
"node_initial_primaries_recoveries" : "12",
"enable" : "all",
"cluster_concurrent_rebalance" : "20",
"node_concurrent_recoveries" : "20"
}
}
},
"indices" : {
"recovery" : {
"max_concurrent_operations" : "4", // max allowed
"max_bytes_per_sec" : "2500mb", // tried to increase higher, not helping
"max_concurrent_file_chunks" : "8" // max allowed
}
}
Thread pool usage. CPU usage is low on the node like 4%
"DTeyUtAzRyCLJ10rR274uw.thread_pool.snapshot.threads": 15,
"DTeyUtAzRyCLJ10rR274uw.thread_pool.snapshot.queue": 88,
"DTeyUtAzRyCLJ10rR274uw.thread_pool.snapshot.active": 15,
"DTeyUtAzRyCLJ10rR274uw.thread_pool.snapshot.rejected": 0,
"DTeyUtAzRyCLJ10rR274uw.thread_pool.snapshot.largest": 15,
What I saw in the monitoring graph is that each node is pulling up to about 400MB/s from the GCS bucket.
I checked on the bucket side and not seeing any threshold is reached.
With the restore going on, if I log onto one of the elasticsarch node and run gsutil to copy files directly from the bucket twhere the snapshot files are saved to the local SSD on the node, I was able to see the write throughput to be increased to the 600MB/s range. So this sounds like ES shall be able to pull more if it could.
Shard size of the indices to be restored are about 30G per shard. Each index have about 2 shards.
Any suggestions?