Red Cluster State: failed to create shard, failure IOException[failed to obtain in-memory shard lock]

Hi Team,

We have 8 node cluster, where we noticed 2 shards were unassigned last week.
We tried this command GET : /_cluster/allocation/explain?pretty and got below explanation-

{
"index": "v20-con*****",
"shard": 2,
"primary": true,
"current_state": "unassigned",
"unassigned_info": {
"reason": "ALLOCATION_FAILED",
"at": "2020-08-14T07:18:40.994Z",
"failed_allocation_attempts": 5,
"details": "failed shard on node [rd3JxClYTDWQV4SMpS-7xQ]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[v20-con*****][2]: obtaining shard lock timed out after 5000ms]; ",
"last_allocation_status": "no"
},
"can_allocate": "no",
"allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy",
"node_allocation_decisions": [
{
"node_id": "2erwYLE2T9GXI4di8Y8-LQ",
"node_name": "VW135-Master",
"transport_address": "10.69.75.
:9300",
"node_attributes": {
"ml.machine_memory": "-1",
"ml.max_open_jobs": "20",
"xpack.installed": "true",
"ml.enabled": "true"
},
"node_decision": "no",
"store": {
"found": false
}
},
{
"node_id": "GXvJAiNGSTOD7PJqq9F-6w",
"node_name": "VW1394-Data",
"transport_address": "10.69.
.:9300",
"node_attributes": {
"ml.machine_memory": "-1",
"ml.max_open_jobs": "20",
"xpack.installed": "true",
"ml.enabled": "true"
},
"node_decision": "no",
"store": {
"found": false
}
},
{
"node_id": "K_WOSocvTR2V-tJ8Q2n5CQ",
"node_name": "VW13
98-Master",
"transport_address": "10.69.
.:9300",
"node_attributes": {
"ml.machine_memory": "-1",
"ml.max_open_jobs": "20",
"xpack.installed": "true",
"ml.enabled": "true"
},
"node_decision": "no",
"store": {
"found": false
}
},
{
"node_id": "TjKfKo2vQVSW-YEHaxrcNw",
"node_name": "VW13
93-data",
"transport_address": "10.69..:9300",
"node_attributes": {
"ml.machine_memory": "-1",
"ml.max_open_jobs": "20",
"xpack.installed": "true",
"ml.enabled": "true"
},
"node_decision": "no",
"store": {
"in_sync": true,
"allocation_id": "xFCQNjx5QFyQU2tq85gPIA",
"store_exception": {
"type": "shard_lock_obtain_failed_exception",
"reason": "[v20-con
****][2]: obtaining shard lock timed out after 5000ms",
"index_uuid": "rpQhip29Rrejy2H9mcFSSg",
"shard": "2",
"index": "v20-con*****"
}
},
"deciders": [
{
"decider": "max_retry",
"decision": "NO",
"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-08-14T07:18:40.994Z], failed_attempts[5], delayed=false, details[failed shard on node [rd3JxClYTDWQV4SMpS-7xQ]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[v20-con*****][2]: obtaining shard lock timed out after 5000ms]; ], allocation_status[deciders_no]]]"
}
]
},
{
"node_id": "VFMbQMokSqyx6msPyiYnLA",
"node_name": "VW1374-data",
"transport_address": "10.69.
.:9300",
"node_attributes": {
"ml.machine_memory": "-1",
"ml.max_open_jobs": "20",
"xpack.installed": "true",
"ml.enabled": "true"
},
"node_decision": "no",
"store": {
"found": false
}
},
{
"node_id": "jyr7gwB_QiKf8utR4PtvoA",
"node_name": "VW13
61-data",
"transport_address": "10.69..:9300",
"node_attributes": {
"ml.machine_memory": "-1",
"ml.max_open_jobs": "20",
"xpack.installed": "true",
"ml.enabled": "true"
},
"node_decision": "no",
"store": {
"found": false
}
},
{
"node_id": "rd3JxClYTDWQV4SMpS-7xQ",
"node_name": "VW1359-data",
"transport_address": "10.69.
.:9300",
"node_attributes": {
"ml.machine_memory": "-1",
"ml.max_open_jobs": "20",
"xpack.installed": "true",
"ml.enabled": "true"
},
"node_decision": "no",
"store": {
"in_sync": true,
"allocation_id": "zeA3D077RxutzNBxqVUaxA"
},
"deciders": [
{
"decider": "max_retry",
"decision": "NO",
"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-08-14T07:18:40.994Z], failed_attempts[5], delayed=false, details[failed shard on node [rd3JxClYTDWQV4SMpS-7xQ]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[v20-con
**][2]: obtaining shard lock timed out after 5000ms]; ], allocation_status[deciders_no]]]"
}
]
},
{
"node_id": "uc8zjL5SQm6tIkmhbKq2ng",
"node_name": "VW1340-Master",
"transport_address": "10.69.
.***:9300",
"node_attributes": {
"ml.machine_memory": "-1",
"ml.max_open_jobs": "20",
"xpack.installed": "true",
"ml.enabled": "true"
},
"node_decision": "no",
"store": {
"found": false
}
}
]
}

As this is prod environment, we restored the latest index backup available to fix this issue.

Please help to understand, In which case such issue occurs and how can we prevent this in future.

Thanks!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.