Why do shards seem to get recreated when they already exist?

Elasticsearch 6.8.6.

I'm wondering if someone can explain why after opening, unfreezing and freezing an index, some replica shards sometimes appear to be get recreated from primaries even though they already exist on disk. (Maybe they aren't being recreated, but that's what it seems like.)

Usually when I opened an index and watch Shard Activity in Kibana Monitoring I see "Recovery type: Existing Store" for every shard, all the shards are allocated quickly and the health goes Green. But sometimes I open an index and for some shards I see "Recovery type: Peer", "Source / Destination" shows that the shard is being copied from one node to another, elsewhere in Monitoring the shard is shown as Initializing, the unassigned shard count remains above 0, hence cluster health remains Yellow, until the recovery has completed. This is irritating because it can take a while with big shards and because I know that those shards exist on disk so why hasn't Elasticsearch used them.

I have witnessed this same behaviour after unfreezing an index and, most puzzlingly, after freezing an index. If I'm freezing an index which is Green, why would Elasticsearch decide that some of the shards have to be initialized by copying them from another node, making the index health Yellow for the duration, when all required shards obviously already exist?

In all these cases I can't find anything in the logs about why this is happening to shards in question. Asking the Explain API about a shard which appears to being recreated from primary returns

"explanation" : "the shard is in the process of initializing on node [whatever], wait until initialization has completed"

I know that sometimes Elasticsearch moves shards around in attempt to achieve it's idea of balance. In such cases the shards are shown as Relocating, not Initializing, and health remains Green. This makes me think that's what I describe above is Elasticsearch recreating the shards, not just relocating them. Also if it were the case that after opening, unfreezing or freezing an index Elasticsearch decided that some of the shards needed to be moved to achieve better balance, what I would expect to happen is that all shards are assigned using what's on disk, Green is quickly achieved, and then some shards are relocated. But maybe that's not how Elasticsearch works in that scenario.

Can anyone explain what's happening in the above scenarios or tell me where to look to maybe find out?

I think this is explained by #46318 which was addressed in 7.5.0. Prior to that, Elasticsearch was sometimes unable to determine that an existing replica on disk was any different from an empty replica, which resulted in suboptimal replica allocation decisions that could have the effects you describe.

1 Like

So I'm right that shards are being recreated, not just moved then?

Am I right in thinking that doing a synced flush helps increase the chances of Elasticsearch recognising replicas as identical to primaries?

All indices I've opened and seen the behaviour I described with were closed by Curator which does a synced flush before closing an index. (Right? The close action has an option skip_flush which defaults to False.) But looking at the index which I froze (which from logs I see involves closing then immediately opening the index) and then saw shards of getting recreated, none of the shards have a sync_id value. The index in question was originally created on a different cluster to the one I froze it on. I transferred it between clusters by having cluster A take a snapshot to an NFS share accessible to both clusters, then having cluster B restore that snapshot. (I tried transferring it using reindex API and my calculations told me it was going to take about two days. The snapshot method took under two hours.) So maybe (I am guessing) there's no sync_id value on any of the shards because the cluster considers it to have never been written to.

I'm not sure I understand the difference. It sounds like Elasticsearch is making a full copy of the shard, but that's how it moves shards.

Yes, it looks like Curator's close action performs a synced flush by default, but I do not think this persists across a snapshot/restore. I think it would help to perform another synced flush manually after the restore.

Note that prior to 7.2.0 closed indices are not actively replicated, so it's possible that there are no good replicas until you open the index.

I stand corrected, the sync_id is preserved across a snapshot/restore. I am guessing that either the synced flush is failing (check your Curator logs) or else something is changing the index while Curator is closing it.

By "just moved" I mean like if a shard is relocated to better achieve balance, or because Elasticsearch determines a shard can no longer live on the node it's on (like disk watermark, or routing attributes got changed). In those situationas the shards are marked as Relocating and the index health remains Green. By "recreated" I mean a replica shard being created by copying a primary because no (good) copy of the replica has been found, during which process the new replica is shown as Initializing and index health is Yellow.

Maybe. But it's not like replicas of closed indices can get lost without someone doing something to cause that. (Can they?) There were certainly replicas before closing the indices, no nodes have been decommission in the relevant time frame* and there's been cases where I have found all the shards on disk before opening an index and then opened it and seen Elasticsearch create a replica from a primary. I haven't done anything (don't know how) to check if the shards on disk are all considered good though.

(*) The one time I did decommission nodes I opened all the indices with shards on them first so Elasticsearch would relocate everything and checked the Elasticsearch data directories were empty.