Mass allocate_stale_primary

We have a pretty sizable Elasticsearch cluster (currently storing 1.2 PB) that has had some data loss due to some PEBCAK. This is for a logging/debugging usecase, so we can tolerate data loss, it's a minor inconvenience, but not a big deal.

I could delete all indices with assigned shards, but I would rather just allocate_stale_primary or allocate_empty_primary en masse to try and save as much data as possible.

My issue is that both of these APIs require a node parameter, which I don't know how to specify. Is there a way to have the cluster automatically determine which node to relocate on?

Alternatively, is there a way to ask the cluster which node would be a good candidate for a shard of an index, which I can then use as the argument to the node parameter?

Is there a better way? Is there a "yeah, screw the unassigned shards, I'm okay with dataloss, just make it green again" button I can easily reach for?

There is not, and that is deliberate. You will need to implement this yourself, and then ideally throw it away once done to avoid the temptation to use it again :slight_smile:

For allocate_empty_primary it doesn't really matter which node to use, since you're allocating an empty primary. If there's a lot, probably best to spread them out a bit, but if you don't then they will be rebalanced eventually.

For allocate_stale_primary you need to find a stale copy of the shard first. GET _shard_stores or the cluster allocation explain API can do this. If you get it wrong and try and allocate a stale primary on a node that doesn't have a copy of the shard then, in 6.5.1 at least, it appears to succeed (returning 200 OK) but in fact it fails a bit later on in the process. I opened #37098 to investigate this further.

Unfortunately, our servers are managed by an internal system that's really aggressive about bringing down "failed" hosts (using a very loose definition of "failed"). We're at 1 replica per primary right now, which is almost good enough to compensate and consistently recover from failed servers, but not quite. Our use-case isn't important enough to cost justify jumping to 2 replicas per primary.

It's like 800 shards. Is that still okay?

Thanks for investigating this!

So I essentially need to do a join. Use the rest API to fetch all unallocated shards (already done), and join it against the _shard_stores rest API to get the correct nodes, and use the result to call the reroute API. Is that correct?

Losing 800 shards' worth of data is stretching the bounds of what I'd call "okay", but it should be possible to allocate them all as empties if that's what you want. Spread them out so there's not too much rebalancing to do afterwards.

Sounds about right, yes. I think I'd do this first and only allocate the empty ones after everything has settled down and you've confirmed there's no more stale ones to go. If you allocate an empty primary then it'll wipe out any existing stale stores.

Heh, it's petabyte scale cluster. That's only about 2% of the shards by count, even less by actual data size

That's the hard part, IDK how to do that in a decent way :confused:

Yep, that's exactly what I was thinking. Excellent. Thanks!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.