I tried to find a relevant document but I could not get the answer.
Let's say there are primary shard and replica shard. I believe if the primary shard goes down, the replica shard promoted to the primary shard and recreate the replica. On the other hand, if replica shard goes down, it will simply recreate the replica based off from the primary.
The question that I have is:
What happens when primary and replica shards go down? Do it impact other shards as well?
Is there a way that can fully recover when both primary and replica shards go down? If there is a recovery action, what happens when there are new inserts or updates to the documents that belong to these shards during the recovery? I believe we can schedule a daily snapshot of the database but this means it will lose any new data that are coming to these shards after the snapshot was taken.
In practical even though the chances of both primary and replica shard failure are rare I believe we should still take more than one replica and distribute to different data center if possible. However, in case of the worst scenario, I am really curious how we can recover from all the primary and replica shards failure.
It depends on what you mean by "go down". If a primary shard fails then you are correct that the master promotes one of the active in-sync replicas to become the new primary. If there are currently no active in-sync replicas then it waits until one appears, and your cluster health reports as RED. So if "go down" means "... and come back up again" then everything's fine. However if all of the in-sync copies of your data permanently disappear then there's not a lot that can be done: you have by definition lost data.
It's not recommended to split a single Elasticsearch cluster across data centres. It expects the node-to-node connections to have low latency.
The best thing to do is to make sure that you do have active in-sync replicas, by keeping your cluster health at GREEN. That way your data is safe unless you suffer from multiple failures involving independent devices all at the same time.
Sure, you can do this too. Indeed many people take snapshots much more frequently than daily which works because they are incremental - a snapshot every 30 minutes is not unusual.
Restoring from a snapshot still means you lose any data written after the snapshot was taken, but maybe this is ok in your case?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.