Mass allocate_stale_primary

amomchilov · January 2, 2019, 10:10pm

We have a pretty sizable Elasticsearch cluster (currently storing 1.2 PB) that has had some data loss due to some PEBCAK. This is for a logging/debugging usecase, so we can tolerate data loss, it's a minor inconvenience, but not a big deal.

I could delete all indices with assigned shards, but I would rather just allocate_stale_primary or allocate_empty_primary en masse to try and save as much data as possible.

My issue is that both of these APIs require a node parameter, which I don't know how to specify. Is there a way to have the cluster automatically determine which node to relocate on?

Alternatively, is there a way to ask the cluster which node would be a good candidate for a shard of an index, which I can then use as the argument to the node parameter?

Is there a better way? Is there a "yeah, screw the unassigned shards, I'm okay with dataloss, just make it green again" button I can easily reach for?

DavidTurner · January 3, 2019, 8:59am

There is not, and that is deliberate. You will need to implement this yourself, and then ideally throw it away once done to avoid the temptation to use it again

For allocate_empty_primary it doesn't really matter which node to use, since you're allocating an empty primary. If there's a lot, probably best to spread them out a bit, but if you don't then they will be rebalanced eventually.

For allocate_stale_primary you need to find a stale copy of the shard first. GET _shard_stores or the cluster allocation explain API can do this. If you get it wrong and try and allocate a stale primary on a node that doesn't have a copy of the shard then, in 6.5.1 at least, it appears to succeed (returning 200 OK) but in fact it fails a bit later on in the process. I opened #37098 to investigate this further.

amomchilov · January 3, 2019, 8:26pm

Unfortunately, our servers are managed by an internal system that's really aggressive about bringing down "failed" hosts (using a very loose definition of "failed"). We're at 1 replica per primary right now, which is almost good enough to compensate and consistently recover from failed servers, but not quite. Our use-case isn't important enough to cost justify jumping to 2 replicas per primary.

It's like 800 shards. Is that still okay?

Thanks for investigating this!

So I essentially need to do a join. Use the rest API to fetch all unallocated shards (already done), and join it against the _shard_stores rest API to get the correct nodes, and use the result to call the reroute API. Is that correct?

DavidTurner · January 3, 2019, 9:51pm

Losing 800 shards' worth of data is stretching the bounds of what I'd call "okay", but it should be possible to allocate them all as empties if that's what you want. Spread them out so there's not too much rebalancing to do afterwards.

Sounds about right, yes. I think I'd do this first and only allocate the empty ones after everything has settled down and you've confirmed there's no more stale ones to go. If you allocate an empty primary then it'll wipe out any existing stale stores.

amomchilov · January 4, 2019, 6:39pm

Heh, it's petabyte scale cluster. That's only about 2% of the shards by count, even less by actual data size

That's the hard part, IDK how to do that in a decent way

Yep, that's exactly what I was thinking. Excellent. Thanks!

system · February 1, 2019, 6:40pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How do i find out "stale primary" quickly Elasticsearch	1	360	July 13, 2020
Perma-Unallocated primary shards after a node has left the cluster Elasticsearch	2	535	July 6, 2017
Shard recovery with only one node in the cluster Elasticsearch	3	654	July 6, 2017
Cluster reroute and potential data loss Elasticsearch	4	3712	July 6, 2017
Assign unassigned primary shard Elasticsearch	7	2471	July 6, 2017

Mass allocate_stale_primary

Related topics