Hi Team,
We have 8 node cluster, where we noticed 2 shards were unassigned last week.
We tried this command GET : /_cluster/allocation/explain?pretty and got below explanation-
{
"index": "v20-con*****",
"shard": 2,
"primary": true,
"current_state": "unassigned",
"unassigned_info": {
"reason": "ALLOCATION_FAILED",
"at": "2020-08-14T07:18:40.994Z",
"failed_allocation_attempts": 5,
"details": "failed shard on node [rd3JxClYTDWQV4SMpS-7xQ]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[v20-con*****][2]: obtaining shard lock timed out after 5000ms]; ",
"last_allocation_status": "no"
},
"can_allocate": "no",
"allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy",
"node_allocation_decisions": [
{
"node_id": "2erwYLE2T9GXI4di8Y8-LQ",
"node_name": "VW135-Master",
"transport_address": "10.69.75.:9300",
"node_attributes": {
"ml.machine_memory": "-1",
"ml.max_open_jobs": "20",
"xpack.installed": "true",
"ml.enabled": "true"
},
"node_decision": "no",
"store": {
"found": false
}
},
{
"node_id": "GXvJAiNGSTOD7PJqq9F-6w",
"node_name": "VW1394-Data",
"transport_address": "10.69..:9300",
"node_attributes": {
"ml.machine_memory": "-1",
"ml.max_open_jobs": "20",
"xpack.installed": "true",
"ml.enabled": "true"
},
"node_decision": "no",
"store": {
"found": false
}
},
{
"node_id": "K_WOSocvTR2V-tJ8Q2n5CQ",
"node_name": "VW1398-Master",
"transport_address": "10.69..:9300",
"node_attributes": {
"ml.machine_memory": "-1",
"ml.max_open_jobs": "20",
"xpack.installed": "true",
"ml.enabled": "true"
},
"node_decision": "no",
"store": {
"found": false
}
},
{
"node_id": "TjKfKo2vQVSW-YEHaxrcNw",
"node_name": "VW1393-data",
"transport_address": "10.69..:9300",
"node_attributes": {
"ml.machine_memory": "-1",
"ml.max_open_jobs": "20",
"xpack.installed": "true",
"ml.enabled": "true"
},
"node_decision": "no",
"store": {
"in_sync": true,
"allocation_id": "xFCQNjx5QFyQU2tq85gPIA",
"store_exception": {
"type": "shard_lock_obtain_failed_exception",
"reason": "[v20-con****][2]: obtaining shard lock timed out after 5000ms",
"index_uuid": "rpQhip29Rrejy2H9mcFSSg",
"shard": "2",
"index": "v20-con*****"
}
},
"deciders": [
{
"decider": "max_retry",
"decision": "NO",
"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-08-14T07:18:40.994Z], failed_attempts[5], delayed=false, details[failed shard on node [rd3JxClYTDWQV4SMpS-7xQ]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[v20-con*****][2]: obtaining shard lock timed out after 5000ms]; ], allocation_status[deciders_no]]]"
}
]
},
{
"node_id": "VFMbQMokSqyx6msPyiYnLA",
"node_name": "VW1374-data",
"transport_address": "10.69..:9300",
"node_attributes": {
"ml.machine_memory": "-1",
"ml.max_open_jobs": "20",
"xpack.installed": "true",
"ml.enabled": "true"
},
"node_decision": "no",
"store": {
"found": false
}
},
{
"node_id": "jyr7gwB_QiKf8utR4PtvoA",
"node_name": "VW1361-data",
"transport_address": "10.69..:9300",
"node_attributes": {
"ml.machine_memory": "-1",
"ml.max_open_jobs": "20",
"xpack.installed": "true",
"ml.enabled": "true"
},
"node_decision": "no",
"store": {
"found": false
}
},
{
"node_id": "rd3JxClYTDWQV4SMpS-7xQ",
"node_name": "VW1359-data",
"transport_address": "10.69..:9300",
"node_attributes": {
"ml.machine_memory": "-1",
"ml.max_open_jobs": "20",
"xpack.installed": "true",
"ml.enabled": "true"
},
"node_decision": "no",
"store": {
"in_sync": true,
"allocation_id": "zeA3D077RxutzNBxqVUaxA"
},
"deciders": [
{
"decider": "max_retry",
"decision": "NO",
"explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-08-14T07:18:40.994Z], failed_attempts[5], delayed=false, details[failed shard on node [rd3JxClYTDWQV4SMpS-7xQ]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[v20-con**][2]: obtaining shard lock timed out after 5000ms]; ], allocation_status[deciders_no]]]"
}
]
},
{
"node_id": "uc8zjL5SQm6tIkmhbKq2ng",
"node_name": "VW1340-Master",
"transport_address": "10.69..***:9300",
"node_attributes": {
"ml.machine_memory": "-1",
"ml.max_open_jobs": "20",
"xpack.installed": "true",
"ml.enabled": "true"
},
"node_decision": "no",
"store": {
"found": false
}
}
]
}
As this is prod environment, we restored the latest index backup available to fix this issue.
Please help to understand, In which case such issue occurs and how can we prevent this in future.
Thanks!