Unassigned shards during data nodes move

(Imran Siddique) #1

Recently we were migrating our data from one region to another region. The approach we selected for data move was -

  1. Update cluster to not allow data in new nodes
  2. Add new data nodes
  3. Update cluster to allow data only in new nodes. This triggers shard movements from old nodes to new data nodes.
  4. Delete old data nodes when all data is moved

What we noticed was there were a few shards who were getting into unassigned state and were never getting moved. They had these reasons attached to them -

1. failed shard on node [qyYUIC2lQVSJ4spmO4Jebw]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested:  **ShardLockObtainFailedException** [[a4003b53-5d8f-43a4-bc11-478dafa9ab65_0][47]: obtaining shard lock timed out after 5000ms]; 

2. failed shard on node [J42mCdMgTEuGwxiL2R-Z0Q]: failed updating shard routing entry, failure  **IndexShardRelocatedException** [CurrentState[RELOCATED] Shard is marked as relocated, cannot safely move to state STARTED]

3. failed shard on node [41DEPwUpQ1yTIdGt6pXWyg]: failed to perform indices:data/write/bulk[s] on replica [cb55739e-4afe-46a3-970f-1b49d8ee7564_2][2], node[41DEPwUpQ1yTIdGt6pXWyg], [R], s[STARTED], a[id=gRvnYbWZQZuovKyVvuip1A], failure  **NodeNotConnectedException** [[es-d56-rm][192.168.0.206:9300] Node not connected]

4. failed shard on node [41DEPwUpQ1yTIdGt6pXWyg]: failed to perform indices:data/write/bulk[s] on replica [7bc5fd9f-6098-479a-a87e-1533d288d438_475c4b23-509d-46b4-867c-0f3106061b15][28], node[41DEPwUpQ1yTIdGt6pXWyg], [R], s[STARTED], a[id=t9MLI5yuSIqKk7cEh3roxA], failure  **NodeDisconnectedException** [[es-d56-rm][192.168.0.206:9300][indices:data/write/bulk[s][r]] disconnected]

We resolved the issue by taking one data node out of the cluster at a time so we have a trigger for the master nodes. Anyone else in this forum has seen these errors?
Thanks
Imran