A sample of the response from the recovery API:
"translog" : {
"recovered" : 17201,
"total" : 4686825,
"percent" : "0.4%",
"total_on_start" : -1,
"total_time_in_millis" : 1790702
}
This is on an index that is no longer being written to. The size is well below the cap, but to get 0.4% for the way through it has taken 30 minutes. At that rate, letting it complete would take far too long (although, admittedly, they seem to eventually go away on their own log before they would it that pace was maintained).
The shards for this index are all around 6GB. And when I get shard stats for the index
"translog" : {
"operations" : 0,
"size_in_bytes" : 55,
"uncommitted_operations" : 0,
"uncommitted_size_in_bytes" : 55,
"earliest_last_modified_age" : 4316474
}
Nothing about this index is special (every since moving to 7.5, I have seen this sort of translog slowness on many different indices). And I have not changed any translog settings away from their defaults.
To get rid of these parasitic translog operations, I have been reducing replica counts on the index to 0 (I thought reducing from 1 to 2 would work, but it always seems to keep the replica around that is being recovered from the translog), allow the recovery operations to be abandoned, and then I increase the replica count back to its original value. At that point, the translog size is reported as 0, and recovery happens very quickly.
From my perspective, we would be better off if the translog did not exist, and we instead just recovered by copying the raw data. Since, translog recovery has always been slow for us, but never this bad. Is there something about the new "soft deletes" (our old cluster was on ES 6.4) that I am not understanding that could be causing this? Assuming that might be to blame, I am tempted to either disable them, or dramatically reduce the lease period.