Unable to recover my cluster

We moved our data to new version of elasticsearch. The new cluster have 3 master, 3 hot and 3 warm nodes. Everything was working fine till this morning and all of a cluster health went red. After looking at it further, I realized due to low storage on warm nodes, shard allocation failed. I increased the volumes and restarted the warm nodes. With _cluster reroute api, I tried recovering the cluster. But nothing happened and now
with _cat/shards?v=true&h=index,shard,prirep,state,node,unassigned.reason&s=state
I am getting almost 83 shards unassigned with reason NODE_LEFT.

Please help.

Thanks in Advance

Welcome to our community! :smiley:

What version are you on? What do your master logs show? What does an ?explain show against one of the shards? What does _cat/recovery?v show?

Hi Mark,
Thanks for your response.
I don't see anything specific to the shard issue on master log
for _cluster/allocation/explain

{
note: "No shard was specified in the explain API request, so this response explains a randomly chosen unassigned shard. There may be other unassigned shards in this cluster which cannot be assigned for different reasons. It may not be possible to assign this shard until one of the other shards is assigned correctly. To explain the allocation of other shards (whether assigned or unassigned) you must specify the target shard in the request to this API.",
index: "hashgraphaccounttransfer-000017",
shard: 3,
primary: true,
current_state: "unassigned",
unassigned_info: {
reason: "NODE_LEFT",
at: "2023-05-09T12:56:28.001Z",
details: "node_left [_u38-owTRr2UwR0kTUH2rw]",
last_allocation_status: "throttled"
},
can_allocate: "throttled",
allocate_explanation: "allocation temporarily throttled",
node_allocation_decisions: [
{
node_id: "_u38-owTRr2UwR0kTUH2rw",
node_name: "warm-node-2",
transport_address: "172.32.1.125:9300",
node_attributes: {
data: "warm",
xpack.installed: "true",
transform.node: "false"
},
node_decision: "throttled",
store: {
in_sync: true,
allocation_id: "r_QbqvGXSkOaLcKouFXdKg"
},
deciders: [
{
decider: "throttling",
decision: "THROTTLE",
explanation: "reached the limit of ongoing initial primary recoveries [6], cluster setting [cluster.routing.allocation.node_initial_primaries_recoveries=6]"
}
]
},
{
node_id: "4nDKSda9Q3SsGH4NqYHgSA",
node_name: "hot-node-3",
transport_address: "172.32.2.17:9300",
node_attributes: {
data: "hot",
xpack.installed: "true",
transform.node: "false"
},
node_decision: "no",
store: {
found: false
}
},
{
node_id: "7JUClIzEQaaqxEd6HpBfUg",
node_name: "warm-node-3a",
transport_address: "172.32.2.68:9300",
node_attributes: {
data: "warm",
xpack.installed: "true",
transform.node: "false"
},
node_decision: "no",
store: {
found: false
}
},
{
node_id: "FsiWEnj-QBO0Tp_5DsFS5Q",
node_name: "warm-node-1",
transport_address: "172.32.0.208:9300",
node_attributes: {
data: "warm",
xpack.installed: "true",
transform.node: "false"
},
node_decision: "no",
store: {
found: false
}
},
{
node_id: "IatGaiCUQ3SEKg9BSXjYtQ",
node_name: "warm-node-2a",
transport_address: "172.32.1.79:9300",
node_attributes: {
data: "warm",
xpack.installed: "true",
transform.node: "false"
},
node_decision: "no",
store: {
found: false
}
},
{
node_id: "SjsTQhboQEC0USoMCP5khQ",
node_name: "warm-node-3",
transport_address: "172.32.2.227:9300",
node_attributes: {
data: "warm",
xpack.installed: "true",
transform.node: "false"
},
node_decision: "no",
store: {
found: false
}
},
{
node_id: "TTeJyCyrSIC281GTQUEh7g",
node_name: "hot-node-1",
transport_address: "172.32.0.246:9300",
node_attributes: {
data: "hot",
xpack.installed: "true",
transform.node: "false"
},
node_decision: "no",
store: {
found: false
}
},
{
node_id: "aVPNgVNwTCiNGT7eQdNevA",
node_name: "hot-node-2",
transport_address: "172.32.1.248:9300",
node_attributes: {
data: "hot",
xpack.installed: "true",
transform.node: "false"
},
node_decision: "no",
store: {
found: false
}
},
{
node_id: "e-LTEHGyR3OUcqEsqC2Yew",
node_name: "warm-node-1a",
transport_address: "172.32.0.176:9300",
node_attributes: {
data: "warm",
xpack.installed: "true",
transform.node: "false"
},
node_decision: "no",
store: {
found: false
}
}
]
}

/_cat/recovery?v

index                             shard time  type        stage source_host  source_node target_host  target_node repository snapshot files files_recovered files_percent files_total bytes       bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
hashgraphtxnsummary-000007        0     81ms  empty_store done  n/a          n/a         172.32.2.17  hot-node-3  n/a        n/a      0     0               0.0%          0           0           0               0.0%          0           0            0                      100.0%
hashgraphtxnsummary-000007        0     90ms  peer        done  172.32.2.17  hot-node-3  172.32.1.248 hot-node-2  n/a        n/a      1     1               100.0%        1           226         226             100.0%        226         16           16                     100.0%
hashgraphtxnsummary-000007        1     70ms  empty_store done  n/a          n/a         172.32.2.17  hot-node-3  n/a        n/a      0     0               0.0%          0           0           0               0.0%          0           0            0                      100.0%
hashgraphtxnsummary-000007        1     93ms  peer        done  172.32.2.17  hot-node-3  172.32.1.248 hot-node-2  n/a        n/a      1     1               100.0%        1           226         226             100.0%        226         12           12                     100.0%
hashgraphtxnsummary-000007        2     89ms  empty_store done  n/a          n/a         172.32.2.17  hot-node-3  n/a        n/a      0     0               0.0%          0           0           0               0.0%          0           0            0                      100.0%
hashgraphtxnsummary-000007        2     93ms  peer        done  172.32.2.17  hot-node-3  172.32.1.248 hot-node-2  n/a        n/a      1     1               100.0%        1           226         226             100.0%        226         13           13                     100.0%
hashgraphtxnsummary-000007        3     89ms  empty_store done  n/a          n/a         172.32.2.17  hot-node-3  n/a        n/a      0     0               0.0%          0           0           0               0.0%          0           0            0                      100.0%
hashgraphtxnsummary-000007        3     98ms  peer        done  172.32.2.17  hot-node-3  172.32.1.248 hot-node-2  n/a        n/a      1     1               100.0%        1           226         226             100.0%        226         15           15                     100.0%
hashgraphtxnsummary-000007        4     88ms  peer        done  172.32.2.17  hot-node-3  172.32.0.246 hot-node-1  n/a        n/a      1     1               100.0%        1           226         226             100.0%        226         16           16                     100.0%
hashgraphtxnsummary-000007        4     79ms  empty_store done  n/a          n/a         172.32.2.17  hot-node-3  n/a        n/a      0     0               0.0%          0           0           0               0.0%          0           0            0                      100.0%
hashgraphhcstxnsummary-000002     0     44ms  empty_store done  n/a          n/a         172.32.0.246 hot-node-1  n/a        n/a      0     0               0.0%          0           0           0               0.0%          0           0            0                      100.0%
hashgraphhcstxnsummary-000002     0     49ms  peer        done  172.32.0.246 hot-node-1  172.32.1.248 hot-node-2  n/a        n/a      1     1               100.0%        1           226         226             100.0%        226         0            0                      100.0%
hashgraphhcstxnsummary-000002     1     77ms  peer        done  172.32.2.17  hot-node-3  172.32.0.246 hot-node-1  n/a        n/a      1     1               100.0%        1           226         226             100.0%        226         0            0                      100.0%
hashgraphhcstxnsummary-000002     1     75ms  empty_store done  n/a          n/a         172.32.2.17  hot-node-3  n/a        n/a      0     0               0.0%          0           0           0               0.0%          0           0            0                      100.0%
hashgraphhcstxnsummary-000002     2     72ms  peer        done  172.32.1.248 hot-node-2  172.32.2.17  hot-node-3  n/a        n/a      1     1               100.0%        1           226         226             100.0%        226         0            0                      100.0%
hashgraphhcstxnsummary-000002     2     34ms  empty_store done  n/a          n/a         172.32.1.248 hot-node-2  n/a        n/a      0     0               0.0%          0           0           0               0.0%          0           0            0                      100.0%
hashgraphhcstxnsummary-000002     3     53ms  empty_store done  n/a          n/a         172.32.0.246 hot-node-1  n/a        n/a      0     0               0.0%          0           0           0               0.0%          0           0            0                      100.0%
hashgraphhcstxnsummary-000002     3     62ms  peer        done  172.32.0.246 hot-node-1  172.32.2.17  hot-node-3  n/a        n/a      1     1               100.0%        1           226         226             100.0%        226         0            0                      100.0%
hashgraphhcstxnsummary-000002     4     61ms  empty_store done  n/a          n/a         172.32.2.17  hot-node-3  n/a        n/a      0     0               0.0%          0           0           0               0.0%          0           0            0                      100.0%
hashgraphhcstxnsummary-000002     4     51ms  peer        done  172.32.2.17  hot-node-3  172.32.1.248 hot-node-2  n/a        n/a      1     1               100.0%        1           226         226             100.0%        226         0            0                      100.0%
hashgraphtxnsummary-000006        0     50ms  empty_store done  n/a          n/a         172.32.0.246 hot-node-1  n/a        n/a      0     0               0.0%          0           0           0               0.0%          0           0            0                      100.0%
hashgraphtxnsummary-000006        0     434ms peer        done  172.32.0.246 hot-node-1  172.32.1.248 hot-node-2  n/a        n/a      1     1               100.0%        1           226         226             100.0%        226         2476         2476                   100.0%
hashgraphtxnsummary-000006        1     62ms  empty_store done  n/a          n/a         172.32.2.17  hot-node-3  n/a        n/a      0     0               0.0%          0           0           0               0.0%          0           0            0                      100.0%
hashgraphtxnsummary-000006        1     1.5s  peer        done  172.32.2.17  hot-node-3  172.32.1.248 hot-node-2  n/a        n/a      1     1               100.0%        1           226         226             100.0%        226         6017         6017                   100.0%
hashgraphtxnsummary-000006        2     29.3s peer        done  172.32.1.248 hot-node-2  172.32.0.246 hot-node-1  n/a        n/a      1     1               100.0%        1           226         226             100.0%        226         121953       121953                 100.0%
hashgraphtxnsummary-000006        2     39ms  empty_store done  n/a          n/a         172.32.1.248 hot-node-2  n/a        n/a      0     0               0.0%          0           0           0               0.0%          0           0            0                      100.0%
hashgraphtxnsummary-000006        3     47ms  empty_store done  n/a          n/a         172.32.0.246 hot-node-1  n/a        n/a      0     0               0.0%          0           0           0               0.0%          0           0            0                      100.0%
hashgraphtxnsummary-000006        3     474ms peer        done  172.32.0.246 hot-node-1  172.32.2.17  hot-node-3  n/a        n/a      1     1               100.0%        1           226         226             100.0%        226         2404         2404                   100.0%
hashgraphtxnsummary-000006        4     1.6s  peer        done  172.32.2.17  hot-node-3  172.32.0.246 hot-node-1  n/a        n/a      1     1               100.0%        1           226         226             100.0%        226         6075         6075                   100.0%

We are on version 7.17.9

post some output of

GET _cluster/settings
GET _cluster/health

mainly check routing allocation and how many shard is still unassigned and relocating

GET _cluster/health

{
cluster_name: "mainnet-es-cluster",
status: "red",
timed_out: false,
number_of_nodes: 12,
number_of_data_nodes: 9,
active_primary_shards: 113,
active_shards: 206,
relocating_shards: 5,
initializing_shards: 19,
unassigned_shards: 63,
delayed_unassigned_shards: 0,
number_of_pending_tasks: 0,
number_of_in_flight_fetch: 0,
task_max_waiting_in_queue_millis: 0,
active_shards_percent_as_number: 71.52777777777779
}

_cluster/settings

persistent: {
cluster: {
routing: {
allocation: {
disk: {
watermark: {
low: "90%",
high: "93%"
}
}
}
}
}
},
transient: {
cluster: {
routing: {
allocation: {
node_initial_primaries_recoveries: "6"
}
}
}
}
}

Total of below 2 is always 82

initializing_shards: 19,
unassigned_shards: 63,

what about
GET _cat/shards

this should show you many shard are reinitializing. if that is the case you have to wait

This is the # of shards reinitializing but it doesn't work either
initializing_shards: 19

this one might causing problem as limit is 6.
may be set to 20 and see if it makes any difference

PUT _cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.node_initial_primaries_recoveries": 20
  }
}

I actually tried changing that it didn't help

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.