I am trying to use the /_clone API in order to clone an Index. Before I clone my current index, I set
"read_only": "true"
as instructed by documentation. I then perform the clone operation and then as soon as the API call to the clone operation succeeds, I move both the new and old index to "read_only": "false"
I do this once a day in the evening, and have been seeing my cluster drop to a RED status. After checking the logs, it seems like this issue is due to one of my 5 primary shards being unassigned.
The primary shard has an error of ALLOCATION_FAILED
.
The full error message after calling GET /_cat/shards/
is
{
"index" : "transactions_1709083815105",
"shard" : 1,
"primary" : true,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "ALLOCATION_FAILED",
"at" : "2024-02-28T01:31:04.433Z",
"failed_allocation_attempts" : 5,
"details" : "failed shard on node [2adetjU2SM-DBPTDdGFh7w]: failed recovery, failure RecoveryFailedException[[transactions_1709083815105][1]: Recovery failed on {dd1352ff268360d1808ddc0395d50ead}{2adetjU2SM-DBPTDdGFh7w}{RaKibgTtQu2rleRUv9VTRg}{10.212.24.21}{10.212.24.21:9300}{dir}{dp_version=20210501, distributed_snapshot_deletion_enabled=false, cold_enabled=false, adv_sec_enabled=true, zone=us-east-1a, cross_cluster_transport_address=2600:1f18:7c4c:4581:40f7:ba0b:d426:6874, awareness_features_enabled=true, global_cpu_usage_ac_supported=true, shard_indexing_pressure_enabled=true, di_number=0, search_backpressure_feature_present=true}]; nested: ClusterBlockException[index [transactions_1709083815105] blocked by: [FORBIDDEN/5/index read-only (api)];]; ",
"last_allocation_status" : "no"
},
"can_allocate" : "no",
"allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
"node_allocation_decisions" : [ {
"node_id" : "2adetjU2SM-DBPTDdGFh7w",
"node_name" : "dd1352ff268360d1808ddc0395d50ead",
"node_decision" : "no",
"weight_ranking" : 1,
"deciders" : [ {
"decider" : "max_retry",
"decision" : "NO",
"explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2024-02-28T01:31:04.433Z], failed_attempts[5], failed_nodes[[2adetjU2SM-DBPTDdGFh7w]], delayed=false, details[failed shard on node [2adetjU2SM-DBPTDdGFh7w]: failed recovery, failure RecoveryFailedException[[transactions_1709083815105][1]: Recovery failed on {dd1352ff268360d1808ddc0395d50ead}{2adetjU2SM-DBPTDdGFh7w}{RaKibgTtQu2rleRUv9VTRg}{10.212.24.21}{10.212.24.21:9300}{dir}{dp_version=20210501, distributed_snapshot_deletion_enabled=false, cold_enabled=false, adv_sec_enabled=true, zone=us-east-1a, cross_cluster_transport_address=2600:1f18:7c4c:4581:40f7:ba0b:d426:6874, awareness_features_enabled=true, global_cpu_usage_ac_supported=true, shard_indexing_pressure_enabled=true, di_number=0, search_backpressure_feature_present=true}]; nested: ClusterBlockException[index [transactions_1709083815105] blocked by: [FORBIDDEN/5/index read-only (api)];]; ], allocation_status[deciders_no]]]"
} ]
}, {
"node_id" : "keC6H8fGS_6dX0l2HGGeNA",
"node_name" : "b024b85c4243d6cb1b107d2d68005ceb",
"node_decision" : "no",
"weight_ranking" : 2,
"deciders" : [ {
"decider" : "max_retry",
"decision" : "NO",
"explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2024-02-28T01:31:04.433Z], failed_attempts[5], failed_nodes[[2adetjU2SM-DBPTDdGFh7w]], delayed=false, details[failed shard on node [2adetjU2SM-DBPTDdGFh7w]: failed recovery, failure RecoveryFailedException[[transactions_1709083815105][1]: Recovery failed on {dd1352ff268360d1808ddc0395d50ead}{2adetjU2SM-DBPTDdGFh7w}{RaKibgTtQu2rleRUv9VTRg}{10.212.24.21}{10.212.24.21:9300}{dir}{dp_version=20210501, distributed_snapshot_deletion_enabled=false, cold_enabled=false, adv_sec_enabled=true, zone=us-east-1a, cross_cluster_transport_address=2600:1f18:7c4c:4581:40f7:ba0b:d426:6874, awareness_features_enabled=true, global_cpu_usage_ac_supported=true, shard_indexing_pressure_enabled=true, di_number=0, search_backpressure_feature_present=true}]; nested: ClusterBlockException[index [transactions_1709083815105] blocked by: [FORBIDDEN/5/index read-only (api)];]; ], allocation_status[deciders_no]]]"
}, {
"decider" : "resize",
"decision" : "NO",
"explanation" : "source primary is allocated on another node"
} ]
}, {
"node_id" : "FJ37FfInRjWJ-BiXWUR8kw",
"node_name" : "4185113e6a8486abe34bf66e71cddd1a",
"node_decision" : "no",
"weight_ranking" : 3,
"deciders" : [ {
"decider" : "max_retry",
"decision" : "NO",
"explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2024-02-28T01:31:04.433Z], failed_attempts[5], failed_nodes[[2adetjU2SM-DBPTDdGFh7w]], delayed=false, details[failed shard on node [2adetjU2SM-DBPTDdGFh7w]: failed recovery, failure RecoveryFailedException[[transactions_1709083815105][1]: Recovery failed on {dd1352ff268360d1808ddc0395d50ead}{2adetjU2SM-DBPTDdGFh7w}{RaKibgTtQu2rleRUv9VTRg}{10.212.24.21}{10.212.24.21:9300}{dir}{dp_version=20210501, distributed_snapshot_deletion_enabled=false, cold_enabled=false, adv_sec_enabled=true, zone=us-east-1a, cross_cluster_transport_address=2600:1f18:7c4c:4581:40f7:ba0b:d426:6874, awareness_features_enabled=true, global_cpu_usage_ac_supported=true, shard_indexing_pressure_enabled=true, di_number=0, search_backpressure_feature_present=true}]; nested: ClusterBlockException[index [transactions_1709083815105] blocked by: [FORBIDDEN/5/index read-only (api)];]; ], allocation_status[deciders_no]]]"
}, {
"decider" : "resize",
"decision" : "NO",
"explanation" : "source primary is allocated on another node"
} ]
} ]
}
I am confused why the shard allocation is failing due to [FORBIDDEN/5/index read-only (api)];];
My first hunch was due to the read-only toggling that I was doing, but since the other 4 shards succeeded, I figure something else is at play. My cluster free space is also around 800GB so I didn't think that the cluster was complaining about free storage space issues.
In addition, this error happens intermittently. I run this code once a day, and I can reproduce it most of the time, but not all of the time.