All copies of shards in unassigned state after intermittent connectivity issue with master

mosiddi · November 29, 2016, 8:45am

We recently ran into an issue where all copies (2 replica and 1 primary) of a shard of an index were in un-allocated state. This happened when the data nodes had master connectivity issue (each data node lost connectivity with master once at different times).

Our hunch is something like below could have happened -

We have 3 data nodes – D11, D12 and D13. One of them had the primary and rest of the 2 had the replica shards.
From the logs, the sequence of events that could have led to this this situation –

Time T1:
** Node D11 had primary and nodes D12 and D13 had replica copies
Time T2:
** Node D12 had n/w issue due to which it was not able to ping master (M12) for almost a minute. Once it could talk to the master again, it started the initialization process for all shards again.
Time T3:
** The replica initialization was almost stuck in node D12. This could be because it was initializing from node D11 which itself left the cluster.
** Node D11 had n/w issue due to which it was not able to ping master (M12) for almost a minute. Once it could talk to the master again, it started the initialization process for all shards again.
** In the meantime, the cluster made D13’s copy as primary
When node D11 came back, the nodes’ copy was marked failed as it had a failed primary and the copy was marked as replica
** At this state, we had 2 failed replicas
Time T4:
** Node D13 had n/w issue due to which it was not able to ping master (M12) for almost a minute.
** Cluster service would have tried making the other copies as primary but they both were in failed state and hence the none of the shard copies were available
** When node D13 joined back, the shard was already marked in failed state and hence the allocation wasn’t explicitly tried by the cluster

anyone has any inputs on the same? Is our understanding correct?
Thanks
Imran

system · December 27, 2016, 8:45am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
During a rolling restart sometimes all replicas of a single shard go into PRIMARY_FAILED Elasticsearch	4	1089	April 13, 2022
Shards UNASSIGNED even tho they exist on disk Elasticsearch	3	523	July 6, 2017
Unassigned primary and replica shards Elasticsearch	6	2058	July 6, 2017
Shards unassigned after some nodes went down Elasticsearch	8	420	September 29, 2020
Why shard unassigned after cluster restart completely? Elasticsearch	1	384	May 28, 2020

All copies of shards in unassigned state after intermittent connectivity issue with master

Related topics