It has been fixed after we shutdown the node to be decommissioned, which caused the replica to become the primary.
Some more background into our experience:
The reason we want to decommission the node was AWS notification about potential HW failure. And I believe the reason shard relocation failed to finish was indeed due to disk drive failure. I encountered IO termination failure message when scp a particular file out of the failing AWS node
as well. This issue could've been avoided if the primary shard was switched to the replica as the first step.
We will upgrade to the latest ASAP. But please take this suggestion into consideration for future releases if it has not been done already.
Once a node being marked as "exclude", the cluster should move all primary shards out of the node immediately.
linked topic: Weird rebalancing strategy - #3 by linkerc