How does elasticsearch move a primary shard?


I wonder, what is the exact process of moving a primary shard to another node? I'm not interested in how to do this with the HTTP API but what happens when the user or elasticsearch decides to do this.
How does the recover and handover happens, can it accept and perform all operations during the process, and how doesn't they get lost or directed to the wrong primary, etc.
Is it documented somewhere in detail?
If not, could somebody with enough knowledge do this?


1 Like

A primary shard does not really move to another node as per say.
When you have a primary shard on node1 and its replica on node2, when node1 stops, the replica on node2 is automatically marked as the primary.

In a sense it moved to node2 but not really physically I'd say.

If you restart node1, then the shard (which was primary) is started as a replica and recovers from the new primary (from node2).

Makes sense?

I can move a primary shard to another node with _cluster/reroute.
What I would like to know:

  • how does the handover happens exactly?
  • what happens with the queries during the process (the recover of the shard can take a lot of time, during that new data arrives into the primary, which needs to be synced etc)?

It would be nice to have this explained in detail in order to undestand its consequences.


1 Like

The main question is why would you want to do that...

Anyway here are some high level explanations.
It creates the shard on the other node (let says node2), it copies segments from node 1 to node 2. It applies missing operations from the transaction log.
Once both shards are in sync, it promotes the shard on node 2 as the primary and removes the copy on node 1.

It's great that elasticsearch has the ability to move shards manually!
I use this feature to achieve a better shard allocation setup, because the built-in isn't really smart. It will happily put very high traffic shards onto the same nodes while placing low traffic ones to others, resulting a high amount of imbalance between the cluster nodes.
Because I know the metrics, I can move them more appropriately.

But my real question is: on a high traffic shard there could be a state where the sync can't be reached. What happens then? Does elasticsearch forcibly move the shard and lose inflight operations?
Or does it pause new operations while a last sync happens?

1 Like

A replica and a primary are basically getting the same load.

If you want to route shards to some specific servers, I'd not use manual shard allocation but shard allocation filtering.

But my real question is: on a high traffic shard there could be a state where the sync can't be reached. What happens then? Does elasticsearch forcibly move the shard and lose inflight operations?

If a shard can not be reached, then probably the inject operation will be rejected.
If the node can not be reached then elasticsearch after some time will decide to reallocate replicas somewhere else in the cluster.

But may be I did not understand the question correctly. @Christian_Dahlqvist might know. :slight_smile:

Shard allocation filter isn't really good for it. I want to spread indices over all machines, just want to do it evenly, so if I have a very frequently used index, I want to have its shards evenly distributed.
Elasticsearch itself doesn't take this into account, that's why I'm doing this manually. But it's working fine...

No, the shard is available. My question is what happens during the handover initiated with a cluster reroute. Will queries get lost? Will elasticsearch reject some of them?
I would like to read about this in a detailed manner if possible, just to understand how this works and what to expect.
It would be good to have a doc page somewhere with this. :slight_smile:

1 Like

I think you misunderstand, the primary doesn't "move" like this. It works by taking an existing replica and promoting it to primary, as mentioned earlier:

Since replicas hold all the same data as primaries anyway, no "last sync" is required. There is a short period of time during which operations might be routed to the old primary after it has been demoted, but there is a mechanism to reroute them to the new primary if this occurs.

1 Like

You may wish to investigate the index.routing.allocation.total_shards_per_node setting for this. This applies a hard limit to the number of shards per node of an index, which may interfere with the way that shards are allocated and lead to unassigned replicas, but when used sparingly it might help.

You may also be interested in a plan to improve this more generally. Your thoughts (or just a +1) on that issue would be welcome.

1 Like

You can also have one index with one single shard. And allocate the index wherever you want.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.