Moving from a 2.4.6 to a 6.3.0 cluster we have noticed that shard recovery is a lot slower. On the 2.4.6 cluster recovery time is ~3 minutes for shards, on the 6.3.0 cluster recovery time is ~9-11 minutes. The clusters are identical in topology, JVM, OS, data, index sizes. I am trying to figure out why the shard recovery times are so different between the versions?
Settings on 2.4.6
cluster.routing.allocation.node_initial_primaries_recoveries: 5
cluster.routing.allocation.node_concurrent_recoveries: 5
indices.recovery.max_bytes_per_sec: 400mb
indices.recovery.concurrent_streams: 5
Settings on 6.3.0
cluster.routing.allocation.node_initial_primaries_recoveries: 5
cluster.routing.allocation.node_concurrent_incoming_recoveries: 5
cluster.routing.allocation.node_concurrent_outgoing_recoveries: 5
cluster.routing.allocation.node_concurrent_recoveries: 5
cluster.routing.allocation.cluster_concurrent_rebalance: 5
indices.recovery.max_bytes_per_sec: 400mb
Index Information
1 index 5 shards 1 replica
Shard size is ~9 GB per shard
I have taken a look at the _recovery API and nothing stands out.
"index": {
"size": {
"total": "8.9gb",
"total_in_bytes": 9587011676,
"reused": "0b",
"reused_in_bytes": 0,
"recovered": "8.9gb",
"recovered_in_bytes": 9587011676,
"percent": "100.0%"
},
"files": {
"total": 107,
"reused": 0,
"recovered": 107,
"percent": "100.0%"
},
"total_time": "9.3m",
"total_time_in_millis": 560295,
"source_throttle_time": "0s",
"source_throttle_time_in_millis": 0,
"target_throttle_time": "0s",
"target_throttle_time_in_millis": 0
},
"translog": {
"recovered": 33530,
"total": 33530,
"percent": "100.0%",
"total_on_start": 33512,
"total_time": "2.3s",
"total_time_in_millis": 2335
},
"verify_index": {
"check_index_time": "0s",
"check_index_time_in_millis": 0,
"total_time": "0s",
"total_time_in_millis": 0
}