Shard Stuck in INITIALIZING and RELOCATING for more than 12 hours

DavidTurner · December 27, 2018, 3:21pm

Sure. It means that Elasticsearch tried to replicate an operation on a shard (e.g. indexing a document) but one of the replicas was unavailable at the time, and the replica previously had been available and in-sync, so it has to be removed from the in-sync set and marked as stale. It normally happens if a node drops off the cluster, but the cluster hasn't reassigned all its shards elsewhere yet, and then you try and write to one of the missing shards. Being a WARN-level log this isn't expected to happen in healthy clusters.

I can't see any more useful logging to add for this beyond what we normally log about nodes joining/leaving the cluster and the cluster health being green.

Correct, it failed in the finalisation stage, not the translog stage as you had said:

It's waiting for the recovering shard's local checkpoint to exceed the global checkpoint, indicating that it's processed all operations in the translog and can now be marked as in-sync.

Looking again at the hot threads output it seems that the generic threadpool is completely full of threads that are stuck here, presumably blocking something that would actually cause any of these checkpoints to update. I've not investigated what. As far as I can tell there's one of these actions per recovery, so this also looks like a consequence of the earlier excess of recoveries.

It does look like removing the replicas from all affected indices will unblock these actions, as will restarting the node, but I can't say what other long-term effects there might be lurking in this cluster. It's got into a very bad state. I recommend a complete restart.

Topic		Replies	Views
Constant Recovering and Unassigned shards for an index Elasticsearch	12	1041	July 6, 2017
Slow startup (replica recovery in logs) Elasticsearch	11	1843	July 6, 2017
ES marking and sending shard failed due to failed recovery in enabling replication Elasticsearch	7	16112	July 5, 2017
Unassigned Shards Elasticsearch	11	930	July 6, 2017
ES Ate My Shards/Indexes Elasticsearch	13	589	July 6, 2017

Shard Stuck in INITIALIZING and RELOCATING for more than 12 hours

Related topics