Swapping primary and replica shard allocations in two node clusters

Suppose there is a two node cluster and indices have 1 replica. Initial shard allocation distributes primaries and replicas across both nodes.

node-1: [p0, p2, p4, p6, p8, r1, r3, r5, r7, r9]
node-2: [p1, p3, p5, p7, p9, r0, r2, r4, r6, r8]

If one node goes down, all primaries get allocated on the remaining node (i.e. their replicas get promoted) --

node-1: [p0, p2, p4, p6, p8, p1, p3, p5, p7, p9]

When node-2 finally comes back up, it ends up with all replica shards, while all primaries are on node-1

node-1: [p0, p2, p4, p6, p8, p1, p3, p5, p7, p9]
node-2: [r0, r2, r4, r6, r8, r1, r3, r5, r7, r9]

Is there any way to bring this cluster back to having both primaries and replicas across both nodes without adding a third node or temporarily reducing replica count?

We don't want all primaries on one node as primaries exert more stress on nodes in update scenarios that we commonly need. (https://github.com/elastic/elasticsearch/issues/41543)

To add some context -- is there a way to force unallocate replica, move primary to a node-2 and then allocate the replica on node-1?

Is this the similar (in terms of cluster overhead) to reducing replica count, letting primaries rebalance, and increasing replica count back on an index?

If you manually cancel the allocation of the primary then the replica will be promoted and then the old primary will be reassigned as a replica and brought back into sync. This isn't a great solution because the shard lacks redundancy while the primary is restarting as a replica.

There is no mechanism for "demoting" a primary shard back to a replica without needing to restart it.

Thanks @DavidTurner. Understand the redundancy risk. Does this have lesser overhead than reducing replica count and then increasing it? (I guess reducing replicas and bringing them back will move all data for primaries, then bring back replicas which would again be peer recovery?)

If you manually cancel the allocation of the primary

Assuming you are referring to CancelAllocationCommand, right?

Yes I think so. If you reduce the replica count then it's likely that the data for the unneeded replicas will actually be deleted from disk, so Elasticsearch will need to completely rebuild these shard copies to rebalance the cluster (i.e. copy the whole shard over the network) and then completely rebuild every shard copy again when you increase the replica count at the end of the process.

OTOH if you cancel the primary's allocation then nothing will be deleted, and then that copy can often be rebuilt as a replica by copying only any missing operations.


This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.