Elasticsearch 7.16 shard recovery slow

hi,

The elasticsearch cluster has 6 hot node and 4 cold node. One cold node is removed caused by hardware failure. So lots of missing replica shards( about 20TB) began to recover. But I found the recovery process was very slow and last for about 4 day.

Take one shard recovery as example: 49.9gb shard recovery lasted for more than 10 hours.

{ - 
  "id": 2,
  "type": "PEER",
  "stage": "INDEX",
  "primary": false,
  "start_time": "2023-12-26T00:59:05.021Z",
  "start_time_in_millis": 1703552345021,
  "total_time": "10.7h",
  "total_time_in_millis": 38671947,
  "source": { - 
    "id": "ixxUhwHFSqKjcVm_QkprxA",
    "host": "13.50.32.24",
    "transport_address": "13.50.32.24:9301",
    "ip": "13.50.32.24",
    "name": "node-2-cold"
  },
  "target": { - 
    "id": "rS7gKUTFRNyJfpRhnzjmCg",
    "host": "13.50.32.25",
    "transport_address": "13.50.32.25:9301",
    "ip": "13.50.32.25",
    "name": "node-3-cold"
  },
  "index": { - 
    "size": { - 
      "total": "49.9gb",
      "total_in_bytes": 53665251764,
      "reused": "0b",
      "reused_in_bytes": 0,
      "recovered": "22.9gb",
      "recovered_in_bytes": 24670228482,
      "recovered_from_snapshot": "0b",
      "recovered_from_snapshot_in_bytes": 0,
      "percent": "46.0%"
    },
    "files": { - 
      "total": 238,
      "reused": 0,
      "recovered": 221,
      "percent": "92.9%"
    },
    "total_time": "10.7h",
    "total_time_in_millis": 38662530,
    "source_throttle_time": "31s",
    "source_throttle_time_in_millis": 31097,
    "target_throttle_time": "30.7m",
    "target_throttle_time_in_millis": 1843266
  },
  "translog": { - 
    "recovered": 0,
    "total": 0,
    "percent": "100.0%",
    "total_on_start": 0,
    "total_time": "0s",
    "total_time_in_millis": 0
  },
  "verify_index": { - 
    "check_index_time": "0s",
    "check_index_time_in_millis": 0,
    "total_time": "0s",
    "total_time_in_millis": 0
  }
}

I tried to adjuest the following configurations but it didn't work.

"cluster.routing.allocation.node_concurrent_recoveries": 50
"indices.recovery.max_bytes_per_sec" : "1000mb"

No matter how I changed the settings, the recovery didn't speed up.

elasticsearch version 7.16, The cold node hardware information:

cpu: 64 core 
memory: 180GB 
storage: Raid0(7.3TB HDD * 10 )
network: 10GB*2

I am quite sure the host load is very low.

Any ideas? Thank you.

I saw a similar issue, Elasticsearch 6.3.0 shard recovery is slow, Setting transport.tcp.compress to false fixed the issue

I suspect this is the problem: HDDs are just not very fast. They're particularly bad at concurrent IO so node_concurrent_recoveries: 50 is going to make things a whole lot worse.

1 Like

Thank you for your quick relay.

I updated the settings

"cluster.routing.allocation.node_concurrent_recoveries": 2
"indices.recovery.max_bytes_per_sec" : "1000mb"

but shard recovery was still slow.

{ - 
  "id": 0,
  "type": "PEER",
  "stage": "INDEX",
  "primary": true,
  "start_time": "2023-12-27T04:56:29.122Z",
  "start_time_in_millis": 1703652989122,
  "total_time": "37.6m",
  "total_time_in_millis": 2258017,
  "source": { - 
    "id": "Tlqnxqj4T9ey359n2jt4Bw",
    "host": "13.50.32.23",
    "transport_address": "13.50.32.23:9300",
    "ip": "13.50.32.23",
    "name": "node-1"
  },
  "target": { - 
    "id": "ixxUhwHFSqKjcVm_QkprxA",
    "host": "13.50.32.24",
    "transport_address": "13.50.32.24:9301",
    "ip": "13.50.32.24",
    "name": "node-2-cold"
  },
  "index": { - 
    "size": { - 
      "total": "51gb",
      "total_in_bytes": 54843923353,
      "reused": "0b",
      "reused_in_bytes": 0,
      "recovered": "50.4gb",
      "recovered_in_bytes": 54123145777,
      "recovered_from_snapshot": "0b",
      "recovered_from_snapshot_in_bytes": 0,
      "percent": "98.7%"
    },
    "files": { - 
      "total": 208,
      "reused": 0,
      "recovered": 207,
      "percent": "99.5%"
    },
    "total_time": "37.6m",
    "total_time_in_millis": 2257942,
    "source_throttle_time": "201.6ms",
    "source_throttle_time_in_millis": 201,
    "target_throttle_time": "1s",
    "target_throttle_time_in_millis": 1043
  },
  "translog": { - 
    "recovered": 0,
    "total": 0,
    "percent": "100.0%",
    "total_on_start": 0,
    "total_time": "0s",
    "total_time_in_millis": 0
  },
  "verify_index": { - 
    "check_index_time": "0s",
    "check_index_time_in_millis": 0,
    "total_time": "0s",
    "total_time_in_millis": 0
  }
}

recoverying of the shard of 50GB cost 37.6mins

What does await and disk utilisation look like on the cold nodes while recovery is ongoing?

snapshot of outputs of iostat when busiest, almost idle other times

sdk to sdj are hdd disks
md1 is raid0 of thest hdd disks.

hi,
I had found the root cause.

The bandwidth of transport network between nodes is 1GB, it reaches the bandwidth limit.

Sorry to interrupt.

20TiB in 4 days is just 60MiB/s.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.