Allocation/re-routing of unassigned shards

Hi,
could you please help to solve ongoing issue with unassigned shards?

The story of these sad example is that cluster with one node goes down(terminated),disk with a elastic data attached to another system with the same 'path data', and now i'm trying to ignore previous node and relocate the outstanding shards to the new one cluster

new cluster info:

ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.51.184.252 47 76 19 5.03 4.85 3.09 mdi * cae1elk-000-001
10.51.184.101 51 61 15 0.62 0.96 1.17 mdi - cae1elk-000-002

overall index information:

[root@cae1elk-000-001 ~]# curl -s -XGET http://`hostname`:9200/_cat/indices?v | grep ^green | wc -l
15
[root@cae1elk-000-001 ~]# curl -s -XGET http://`hostname`:9200/_cat/indices?v | grep ^red | wc -l
190

cluster status:

{
"cluster_name" : "cae_ops_cluster",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 218,
"active_shards" : 248,
"relocating_shards" : 0,
"initializing_shards" : 2,
"unassigned_shards" : 568,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 30.317848410757946
}

when i'm applying workaround from: https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-reroute.html
i'm getting the:

curl -s -XGET http://`hostname`:9200/_cluster/allocation/explain?pretty
{
"index" : "snmptrap-events-2017.05.17",
"shard" : 0,
"primary" : true,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "CLUSTER_RECOVERED",
"at" : "2017-11-13T14:36:23.256Z",
"last_allocation_status" : "no_valid_shard_copy"
},
"can_allocate" : "no_valid_shard_copy",
"allocate_explanation" : "cannot allocate because a previous copy of the primary shard existed but can no longer be found on the nodes in the cluster",
"node_allocation_decisions" : [
{
"node_id" : "DJtotx_7SMaGCUecvyknEQ",
"node_name" : "cae1elk-000-001",
"transport_address" : "10.51.184.252:9300",
"node_attributes" : {
"aws_availability_zone" : "us-east-1a"
},
"node_decision" : "no",
"store" : {
"found" : false
}
},
{
"node_id" : "jcNMPApaSCeptag3PxW_8w",
"node_name" : "cae1elk-000-002",
"transport_address" : "10.51.184.101:9300",
"node_attributes" : {
"aws_availability_zone" : "us-east-1a"
},
"node_decision" : "no",
"store" : {
"found" : false
}
}
]
}

[root@cae1elk-000-001 ~]# curl -XPOST http://`hostname`:9200/_cluster/reroute -d '@4'
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"[move_allocation] can't move 0, failed to find it on node {cae1elk-000-001}{DJtotx_7SMaGCUecvyknEQ}{50-UmnMpQWO2CB2t2R9wHQ}{10.51.184.252}{10.51.184.252:9300}{aws_availability_zone=us-east-1a}"}],"type":"illegal_argument_exception","reason":"[move_allocation] can't move 0, failed to find it on node {cae1elk-000-001}{DJtotx_7SMaGCUecvyknEQ}{50-UmnMpQWO2CB2t2R9wHQ}{10.51.184.252}{10.51.184.252:9300}{aws_availability_zone=us-east-1a}"},"status":400}[root@cae1elk-000-001 ~]# cat 4
{
"commands" : [
{
"move" : {
"index" : "snmptrap-events-2017.05.17", "shard" : 0,
"from_node" : "cae1elk-000-001", "to_node" : "cae1elk-000-002"
}
},
{
"allocate_replica" : {
"index" : "snmptrap-events-2017.05.17", "shard" : 1,
"node" : "cae1elk-000-002"
}
}
]
}

Is their any possibility in my case somehow to restore the shards, even with lost info?

What was the replication factor? We faced this issue with Elastic 2.4 cluster, where some nodes had root volume full and as a result, ES started acting out. Consequently, new shards got corrupted both primary and few of replica. First thing we did was disabled reallocation to prevent further moving of data. Tried to identify the nodes with good replicas. Deleted a node from the cluster, still having re-allocation disabled. Reattached the EBS volume to a fresh node and then enabled reacclocation.
Several of the index became green, however there was one which was still red with 2 shards being bad.

After a lot of try, we gave up and restored a fresh instance of ES cluster from last night's backup. And copied the data on the bad index across cluster.
Would be watching out the thread if there is a better solution with ES 5x. Good luck!

Fixed somehow,but still trying to investigate,what kind of workaround was working in my case.
Thanks anyway for suggested way!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.