If Primary and Replica shards both fail how to recover?

jwlee · February 15, 2019, 7:03pm

I tried to find a relevant document but I could not get the answer.

Let's say there are primary shard and replica shard. I believe if the primary shard goes down, the replica shard promoted to the primary shard and recreate the replica. On the other hand, if replica shard goes down, it will simply recreate the replica based off from the primary.

The question that I have is:

What happens when primary and replica shards go down? Do it impact other shards as well?
Is there a way that can fully recover when both primary and replica shards go down? If there is a recovery action, what happens when there are new inserts or updates to the documents that belong to these shards during the recovery? I believe we can schedule a daily snapshot of the database but this means it will lose any new data that are coming to these shards after the snapshot was taken.

In practical even though the chances of both primary and replica shard failure are rare I believe we should still take more than one replica and distribute to different data center if possible. However, in case of the worst scenario, I am really curious how we can recover from all the primary and replica shards failure.

Any inputs are really appreciated. Thanks.

DavidTurner · February 15, 2019, 7:18pm

It depends on what you mean by "go down". If a primary shard fails then you are correct that the master promotes one of the active in-sync replicas to become the new primary. If there are currently no active in-sync replicas then it waits until one appears, and your cluster health reports as RED. So if "go down" means "... and come back up again" then everything's fine. However if all of the in-sync copies of your data permanently disappear then there's not a lot that can be done: you have by definition lost data.

It's not recommended to split a single Elasticsearch cluster across data centres. It expects the node-to-node connections to have low latency.

jwlee · February 15, 2019, 8:56pm

Thanks for the inputs.

Not sure if everything is fine. Wouldn't new requests be lost when it waits for active in-sync replica to appear?

What happens when there are new insert or update requests to the shards during the cluster health is RED?

DavidTurner · February 15, 2019, 10:25pm

Not lost, no, such requests would be rejected.

jwlee · February 20, 2019, 7:24pm

So if primary shard fails and there is no active in-sync replicas then the data has been lost. Correct?

If so, is there a way that we can mitigate this risk proactively? Such as taking daily db snapshot as a backup?

DavidTurner · February 20, 2019, 9:45pm

The best thing to do is to make sure that you do have active in-sync replicas, by keeping your cluster health at GREEN. That way your data is safe unless you suffer from multiple failures involving independent devices all at the same time.

Sure, you can do this too. Indeed many people take snapshots much more frequently than daily which works because they are incremental - a snapshot every 30 minutes is not unusual.

Restoring from a snapshot still means you lose any data written after the snapshot was taken, but maybe this is ok in your case?

system · March 20, 2019, 9:50pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
What are the common situations in which primary shard goes down Elasticsearch	2	322	October 14, 2019
What is the allocation process when a primary shard goes down? Elasticsearch	5	2587	July 5, 2017
Can Replica shards be used for Disaster Recovery Elasticsearch	2	166	October 29, 2022
What happens when synchronization between the primary and replica shards is lost? Elasticsearch	5	59	October 16, 2024
Elastic Search - Node failures Elasticsearch	2	337	June 19, 2020

If Primary and Replica shards both fail how to recover?

Related topics