Recovery of closed indices

General: mixed fleet of 7.4/7.9 instances running on Kubernetes. Mostly log ingest. Every cluster has some newer open indices and some older closed indices.

Scenario: a data node leaves the cluster. Node_left has been increased, so recovery is deferred. The node rejoins the cluster. Recovery begins. We observe that shards for closed indices are slow to recover.

Hypothesis: counterintuitively, shards for closed indices are always recovered by copying the primary, despite the fact that the shards have actually been in sync for a long time.

Is this hypothesis correct for 7.4 - 7.9? If not, is there another hypothesis to account for the observably slow recovery of closed indices in our clusters?

Welcome to our community! :smiley:

In the same cluster, or different ones? I think it's the latter, just want to make sure.

Different ones. Sorry for the confusion. Summary: if the answer is version-dependent, some of our clusters are on 7.4, and some on 7.9.

The answer is probably version dependent, there have been changes in how shard recovery works between those two versions. 7.4 is now EOL so it'd be good to focus on the 7.9 case.

This is kind of technically true as written, but misleading. All shards are recovered by copying data from the primary. But shards that are in sync should not take long to recover after a restart, whether they're closed or not, because there's almost nothing to copy.

Can you share more precise evidence of what you're seeing? What does GET _recovery report while the recoveries are ongoing? Can you reproduce this with logger.org.elasticsearch.indices.recovery: TRACE so we can see a bit more about the path Elasticsearch is taking?

Thank you for responding. I understand your point about recovery always copying from the primary, and I will focus on 7.9. Before I upload data, I'd like to make terms like "slow" and "pretty fast" more objective.

On our largest 7.9.1 system, two indices are created each day. Each index ingests about 10GB/hr, 240GB/day; there is 1 replica, so 480GB storage/index/day; 2 indices/day, so pushing 1TB storage/day. This goes into 40-50 shards/day (we've been tuning) so about 20GB/shard. We keep indices open for about 14 days, close them, and then keep the closed indices on line for about 2 more weeks.

There are about 50 data nodes. With about 50 shards/day, 50 data nodes, and 30 days retention, every data node that leaves takes out about 30 shards, about half assigned to open indices on average and the other half to closed.

The ES data nodes are Kubernetes pods. They leave the cluster periodically because we drain their K8s worker node for operational reasons. In a recent example, an ES data node (pod) leaving resulted in about 15 yellow open indices and 14 yellow closed indices. The data pod rescheduled, reconnected to its storage, and rejoined the ES cluster before the node_left timeout expired.

It took 2+ hours to recover the open indices. Then it then took 2 hours to recover the 14 yellow closed indices, an average of about 8 minutes each. Some closed indices recovered in less than 1 minute, while others took 10-15 minutes.

Is 8 minutes/index for closed indices what you would call "pretty fast" recovery?

No. That's long enough to rebuild a whole 20GB shard from scratch. A trivial recovery of a shard typically takes a few seconds at worst.

OK. I have sanitized out our internal acronyms and IPs from the JSON returned by the Recovery API on one of these slow-to-recover closed indices, but I don't see any way to attach it. Should I dump the 38K file here in the post or is there a better way? I tried your upload button, but it seems to only want images.

Use gist/pastebin/etc

Thanks for your help on this. Can't proceed to put this stuff up. I'm sorry, didn't mean to drag you into something I couldn't finish but ... I can't.