Recently we were migrating our data from one region to another region. The approach we selected for data move was -
- Update cluster to not allow data in new nodes
- Add new data nodes
- Update cluster to allow data only in new nodes. This triggers shard movements from old nodes to new data nodes.
- Delete old data nodes when all data is moved
What we noticed was there were a few shards who were getting into unassigned state and were never getting moved. They had these reasons attached to them -
1. failed shard on node [qyYUIC2lQVSJ4spmO4Jebw]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: **ShardLockObtainFailedException** [[a4003b53-5d8f-43a4-bc11-478dafa9ab65_0][47]: obtaining shard lock timed out after 5000ms];
2. failed shard on node [J42mCdMgTEuGwxiL2R-Z0Q]: failed updating shard routing entry, failure **IndexShardRelocatedException** [CurrentState[RELOCATED] Shard is marked as relocated, cannot safely move to state STARTED]
3. failed shard on node [41DEPwUpQ1yTIdGt6pXWyg]: failed to perform indices:data/write/bulk[s] on replica [cb55739e-4afe-46a3-970f-1b49d8ee7564_2][2], node[41DEPwUpQ1yTIdGt6pXWyg], [R], s[STARTED], a[id=gRvnYbWZQZuovKyVvuip1A], failure **NodeNotConnectedException** [[es-d56-rm][192.168.0.206:9300] Node not connected]
4. failed shard on node [41DEPwUpQ1yTIdGt6pXWyg]: failed to perform indices:data/write/bulk[s] on replica [7bc5fd9f-6098-479a-a87e-1533d288d438_475c4b23-509d-46b4-867c-0f3106061b15][28], node[41DEPwUpQ1yTIdGt6pXWyg], [R], s[STARTED], a[id=t9MLI5yuSIqKk7cEh3roxA], failure **NodeDisconnectedException** [[es-d56-rm][192.168.0.206:9300][indices:data/write/bulk[s][r]] disconnected]
We resolved the issue by taking one data node out of the cluster at a time so we have a trigger for the master nodes. Anyone else in this forum has seen these errors?
Thanks
Imran