We have a cluster of 8 hot nodes and 32 warm nodes and a process to auto move old index from hot to warm.
This process works well until this week when it moves index-2019.05.12.
index-2019.05.12 has 40 primary shards and 1 replication. When we are checking, 39 shards are green but shard 0 replica shard stuck on RELOCATING.
From _cat/shards, it says:
index-2019.05.12 0 r RELOCATING 17332920 28.8gb 10.50.50.84 host8-0 -> 10.50.50.201 eF81IZspSDSVbfH0a0fNkw host20-3
From _cat/recovery, it says:
index-2019.05.12 0 11.1m peer index elasticsearch-host10-b host10-1 elasticsearch-host20-d host20-3 n/a n/a 15 0 0.0% 15 30982519357 0 0.0% 309825
19357 0 0 100.0%
Notice the information inconsistence from _cat/shards and _cat/recovery.
host8-0 is hot node while host10-1 and host20-3 are warm nodes.
By checking tasks on host8-0 and host10-1, I can see host10-1 has a recovery task.
After restart host10-1, shard 0 relocation start to make progress.
The question is why host10-1 fail to recovery shard 0 in initial attempt and how to avoid it?