Greetings, people who know way more about ElasticSearch than I do! I'm very new to Elastic administration, as our organization has only recently started using it, so please pardon my lack of knowledge.
We have a 4-node cluster running ElasticSearch 6.8.13, and we have a corrupted index.
Curling "/_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED" returns the following output:
portal_CorruptedIndexName_1625203723 3 p UNASSIGNED MANUAL_ALLOCATION
portal_CorruptedIndexName_1625203723 3 r UNASSIGNED MANUAL_ALLOCATION
portal_CorruptedIndexName_1625203723 0 p UNASSIGNED MANUAL_ALLOCATION
portal_CorruptedIndexName_1625203723 0 r UNASSIGNED MANUAL_ALLOCATION
Following is the output from "/_cluster/allocation/explain":
{
"index" : "portal_CorruptedIndexName_1625203723",
"shard" : 3,
"primary" : true,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "MANUAL_ALLOCATION",
"at" : "2021-07-16T06:34:18.221Z",
"details" : "failed shard on node [iyfZi8qOSUuejn7IHNZaBw]: shard failure, reason [corrupt file (source: [start])], failure CorruptIndexException[Problem reading index. (resource=/elasticsearch/data/nodes/0/indices/2t3xC5NqS468Ft5hFP61Qg/3/index/_fy_Lucene50_0.tim)]; nested: NoSuchFileException[/elasticsearch/data/nodes/0/indices/2t3xC5NqS468Ft5hFP61Qg/3/index/_fy_Lucene50_0.tim]; ",
"last_allocation_status" : "no_valid_shard_copy"
},
"can_allocate" : "no_valid_shard_copy",
"allocate_explanation" : "cannot allocate because all found copies of the shard are either stale or corrupt",
"node_allocation_decisions" : [
{
"node_id" : "W5YlxEiMQv2iely9YJn9sw",
"node_name" : "XXXXXXELSC01N02",
"node_attributes" : {
"ml.machine_memory" : "8041381888",
"ml.max_open_jobs" : "20",
"xpack.installed" : "true",
"ml.enabled" : "true"
},
"node_decision" : "no",
"store" : {
"found" : false
}
},
{
"node_id" : "iyfZi8qOSUuejn7IHNZaBw",
"node_name" : "XXXXXXELSC01N04",
"node_attributes" : {
"ml.machine_memory" : "8041381888",
"ml.max_open_jobs" : "20",
"xpack.installed" : "true",
"ml.enabled" : "true"
},
"node_decision" : "no",
"store" : {
"in_sync" : true,
"allocation_id" : "vzIbJUz1TgSOmRZsITp0SQ",
"store_exception" : {
"type" : "corrupt_index_exception",
"reason" : "failed engine (reason: [corrupt file (source: [start])]) (resource=preexisting_corruption)",
"caused_by" : {
"type" : "i_o_exception",
"reason" : "failed engine (reason: [corrupt file (source: [start])])",
"caused_by" : {
"type" : "corrupt_index_exception",
"reason" : "Problem reading index. (resource=/elasticsearch/data/nodes/0/indices/2t3xC5NqS468Ft5hFP61Qg/3/index/_fy_Lucene50_0.tim)",
"caused_by" : {
"type" : "no_such_file_exception",
"reason" : "/elasticsearch/data/nodes/0/indices/2t3xC5NqS468Ft5hFP61Qg/3/index/_fy_Lucene50_0.tim"
}
}
}
}
}
},
{
"node_id" : "mmjyzudUR5Sapj23Y0AdRw",
"node_name" : "XXXXXXELSC01N01",
"node_attributes" : {
"ml.machine_memory" : "8041381888",
"xpack.installed" : "true",
"ml.max_open_jobs" : "20",
"ml.enabled" : "true"
},
"node_decision" : "no",
"store" : {
"found" : false
}
},
{
"node_id" : "wBDiKORZRt-SP-F3ZcvlsA",
"node_name" : "XXXXXXELSC01N03",
"node_attributes" : {
"ml.machine_memory" : "8041381888",
"ml.max_open_jobs" : "20",
"xpack.installed" : "true",
"ml.enabled" : "true"
},
"node_decision" : "no",
"store" : {
"in_sync" : false,
"allocation_id" : "7VC_Ru5VRbiZxZQy8_8N_w"
}
}
]
}
Unfortunately, we didn't discover that our snapshots had stopped running until after the index became corrupted, so restoring from a snapshot is not an option. Since the cause of the corruption appears to be a missing file, I tried shutting down the node where the file was missing, and allowing the cluster to fully recover, hoping that would reallocate the index, but that didn't resolve the issue.
I'm wondering if it's possible to do something to recreate that file and recover the index, ideally without losing any data.
Many thanks in advance for any assistance!