I'm trying to get a good grasp on how long I can expect for a replica shard to take to transfer between nodes.
Running ES 6.2 on Windows, I ran a few tests.
First I used the "/_cluster/reroute" API to get a good baseline of how long it took to move a shard from one node to another. I picked an index that was no longer being written to, selected a replica shard on a particular node, and rerouted it to another node (I made sure there were no other shards moving around at the same time). When I did this, 8gb shards were consistently taking about 8 minutes. I ran the recovery api to monitor the relocation, and here's a snapshot of that, mid transfer:
"indexname": {
"shards": [
{
"id": 61,
"type": "PEER",
"stage": "INDEX",
"primary": false,
"start_time_in_millis": 1528993584252,
"total_time_in_millis": 384957,
"source": {
"id": "IeDiGU9LStCvP0rqzwhLJg",
"host": "shost",
"transport_address": "1.1.1.1:9300",
"ip": "1.1.1.1",
"name": "shost"
},
"target": {
"id": "vH5lZXG7T_SargmNTQc6xQ",
"host": "thost",
"transport_address": "2.2.2.2:9300",
"ip": "2.2.2.2",
"name": "thost"
},
"index": {
"size": {
"total_in_bytes": 8268932538,
"reused_in_bytes": 0,
"recovered_in_bytes": 6892110768,
"percent": "83.3%"
},
"files": {
"total": 118,
"reused": 0,
"recovered": 117,
"percent": "99.2%"
},
"total_time_in_millis": 384939,
"source_throttle_time_in_millis": 0,
"target_throttle_time_in_millis": 0
},
"translog": {
"recovered": 0,
"total": 0,
"percent": "100.0%",
"total_on_start": 0,
"total_time_in_millis": 0
},
"verify_index": {
"check_index_time_in_millis": 0,
"total_time_in_millis": 0
}
}
I'm not seeing any throttling, everything appears to go smoothly (though slowly). 99% of the time is spent on the last few files, but that makes sense. When I move a smaller shard, it sticks close to that 1gb/minute ratio.
However, I know that our hardware can handle much faster data transfer (fast network, SSD drives). To get an idea of the fastest our hardware could handle I manually copied data between the machines and that same 8gbs took under 30 seconds.
I do expect there to be some overhead with ES, but this seems extreme.
To get to the point where we are throttled by the "indices.recovery.max_bytes_per_sec" setting I have to bring it down below the speeds we are seeing, down below ~20mb. Anything higher than that has no effect so there is something else slowing it down. I believe that setting is the most obvious first step, but I think I have eliminated it as a factor.
I'm hoping that there are some ES settings that I have not found yet. I've searched around for some, but most are settings that have since been deprecated. Any suggestions would be appreciated.
As we are built today, overall cluster performance is... okay (though it would be vastly improved if we could make better use of our hardware). We expect groups of nodes to go down on a semi-regular basis, and when they come back it seems to take about 30 minutes to transfer over it's share of shards, which is not quite ideal. What's more: on average our shards are small compared to what I read about other people's clusters. I would really like to decrease the shard counts to have bigger shards, but I'm worried about what would happen to our stability if individual shards started taking 20 minutes to move between nodes (I know about increasing the concurrency settings, but that will not improve the minimum recovery time).