I've seen lots of people that talk about wanting primary shard evenly distributed - or not to be on certain nodes and the general answer given is "why - it doesn't matter - the replica does the same work as the primary".
But I'm curious as to the effect of a primary shard going down vs a replica. What happens when a primary shard goes down - what happens to write that happen while the cluster is reallocating and how intensive is the reallocation process?
When a primary goes down, one of the replicas is automatically promoted as primary by the master node.
Then a new replica is allocated in the cluster by the master node and data are copied over the wire.
Write operations are still possible during this time because you still have a replica in the cluster (index is in yellow state).
But I'm still not 100% clear. I have a working cluster with a primary and replica shard. When the primary goes down - how does the master know? I assume there is some time until that node is marked as "down" until the replica is promoted, no? Otherwise a small network pause would cause failovers. How long is that and what happens in that time window?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.