hi,
The elasticsearch cluster has 6 hot node and 4 cold node. One cold node is removed caused by hardware failure. So lots of missing replica shards( about 20TB) began to recover. But I found the recovery process was very slow and last for about 4 day.
Take one shard recovery as example: 49.9gb shard recovery lasted for more than 10 hours.
{ -
"id": 2,
"type": "PEER",
"stage": "INDEX",
"primary": false,
"start_time": "2023-12-26T00:59:05.021Z",
"start_time_in_millis": 1703552345021,
"total_time": "10.7h",
"total_time_in_millis": 38671947,
"source": { -
"id": "ixxUhwHFSqKjcVm_QkprxA",
"host": "13.50.32.24",
"transport_address": "13.50.32.24:9301",
"ip": "13.50.32.24",
"name": "node-2-cold"
},
"target": { -
"id": "rS7gKUTFRNyJfpRhnzjmCg",
"host": "13.50.32.25",
"transport_address": "13.50.32.25:9301",
"ip": "13.50.32.25",
"name": "node-3-cold"
},
"index": { -
"size": { -
"total": "49.9gb",
"total_in_bytes": 53665251764,
"reused": "0b",
"reused_in_bytes": 0,
"recovered": "22.9gb",
"recovered_in_bytes": 24670228482,
"recovered_from_snapshot": "0b",
"recovered_from_snapshot_in_bytes": 0,
"percent": "46.0%"
},
"files": { -
"total": 238,
"reused": 0,
"recovered": 221,
"percent": "92.9%"
},
"total_time": "10.7h",
"total_time_in_millis": 38662530,
"source_throttle_time": "31s",
"source_throttle_time_in_millis": 31097,
"target_throttle_time": "30.7m",
"target_throttle_time_in_millis": 1843266
},
"translog": { -
"recovered": 0,
"total": 0,
"percent": "100.0%",
"total_on_start": 0,
"total_time": "0s",
"total_time_in_millis": 0
},
"verify_index": { -
"check_index_time": "0s",
"check_index_time_in_millis": 0,
"total_time": "0s",
"total_time_in_millis": 0
}
}
I tried to adjuest the following configurations but it didn't work.
"cluster.routing.allocation.node_concurrent_recoveries": 50
"indices.recovery.max_bytes_per_sec" : "1000mb"
No matter how I changed the settings, the recovery didn't speed up.
elasticsearch version 7.16, The cold node hardware information:
cpu: 64 core
memory: 180GB
storage: Raid0(7.3TB HDD * 10 )
network: 10GB*2
I am quite sure the host load is very low.
Any ideas? Thank you.
I saw a similar issue, Elasticsearch 6.3.0 shard recovery is slow, Setting transport.tcp.compress
to false fixed the issue