[allocate_stale_primary] primary is already assigned

jethroft · August 4, 2018, 12:15am

My ES 5.6.3 cluster is yellow:

administrator@srv4-sv:~$ curl -XGET 'localhost:9200/_cluster/health?pretty'
{
  "cluster_name" : "redacted",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 15,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 172,
  "active_shards" : 350,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 1,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 99.71509971509973
}

administrator@srv4-sv:~$ curl -sS -XGET localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED
stats_new             1 r UNASSIGNED ALLOCATION_FAILED

administrator@srv4-sv:~$ curl -sS -X POST 'localhost:9200/_cluster/allocation/explain?pretty'
{
  "index" : "stats_new",
  "shard" : 1,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2018-08-03T22:38:21.312Z",
    "failed_allocation_attempts" : 5,
    "details" : "failed recovery, failure RecoveryFailedException[[stats_new][1]: Recovery failed from {srv4-sv}{w_6a_qIfTN2L8BN0wGiCyQ}{0KxMMAQETZywBTVNtriTkA}{10.64.2.17}{10.64.2.17:9300} into {srv4-ch}{7iSlk_IzRleE3AQag6cAfA}{i5S74TwfQeK
lih5okbSjvQ}{10.64.3.17}{10.64.3.17:9300}]; nested: RemoteTransportException[[srv4-sv][10.64.2.17:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to
 transfer [0] files with total size of [0b]]; nested: IllegalStateException[try to recover [stats_new][1] from primary shard with sync id but number of docs differ: 29983940 (srv4-sv, primary) vs 29983938(srv4-ch)]; ",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [ ... ]
}

According to the Internet, the fix is this:

administrator@srv4-sv:~$ curl -sS -XPOST 'localhost:9200/_cluster/reroute' -d '{"commands":[{"allocate_stale_primary":{"index":"stats_new","shard":1,"node":"srv4-sv","accept_data_loss":true}}]}'

but that doesn't work:

{"error":{"root_cause":[{"type":"remote_transport_exception","reason":"[srv6-sv][10.64.2.21:9300][cluster:admin/reroute]"}],"type":"illegal_argument_exception","reason":"[allocate_stale_primary] primary [stats_new][1] is already assigned"},"status":400}

Thoughts?

jethroft · August 6, 2018, 3:57pm

(bump)

Christian_Dahlqvist · August 6, 2018, 4:03pm

Are all your nodes using exactly the same version of Elasticsearch (use the cat nodes API to check)?

jethroft · August 6, 2018, 4:17pm

Yes, they're all on 5.6.3

nhat · August 6, 2018, 6:32pm

@jethroft You need to remove the sync_id, then retry allocation.

Issue a force flush to remove the sync-id: POST /stats_new/_flush?force=true
Issue a cluster reroute to retry allocation: POST /_cluster/reroute?retry_failed

I hope this helps.

jethroft · August 6, 2018, 7:25pm

@nhat Thanks. I tried that, but I don't think it did anything:

$ curl -sS -XPOST 'localhost:9200/stats_new/_flush?force=true'

{"_shards":{"total":10,"successful":9,"failed":0}}

$ curl -sS -XPOST 'localhost:9200/_cluster/reroute?retry_failed'

{
  "acknowledged": true,
  "state": {
    "version": 21335,
    "state_uuid": "WZNP_dwGQ1iXfpvKMfZHBg",
    "master_node": "-qnu-AcDSyyuSQZbZ_ykfQ",
    "blocks": {},
    "nodes": { ... },
    "routing_table": {
      "indices": {
        ...
        "stats_new": {
          "shards": {
            ...
            "1": [
              {
                "state": "STARTED",
                "primary": true,
                "node": "w_6a_qIfTN2L8BN0wGiCyQ",
                "relocating_node": null,
                "shard": 1,
                "index": "stats_new",
                "allocation_id": {
                  "id": "ikEbwB1lQPuBCLUeG96VzA"
                }
              },
              {
                "state": "UNASSIGNED",
                "primary": false,
                "node": null,
                "relocating_node": null,
                "shard": 1,
                "index": "stats_new",
                "recovery_source": {
                  "type": "PEER"
                },
                "unassigned_info": {
                  "reason": "ALLOCATION_FAILED",
                  "at": "2018-08-03T22:38:21.312Z",
                  "failed_attempts": 5,
                  "delayed": false,
                  "details": "failed recovery, failure RecoveryFailedException[[stats_new][1]: Recovery failed from {srv4-sv}{w_6a_qIfTN2L8BN0wGiCyQ}{0KxMMAQETZywBTVNtriTkA}{10.64.2.17}{10.64.2.17:9300} into {srv4-ch}{7iSlk_IzRleE3AQag6cAfA}{i5S74TwfQeKlih5okbSjvQ}{10.64.3.17}{10.64.3.17:9300}]; nested: RemoteTransportException[[srv4-sv][10.64.2.17:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [0] files with total size of [0b]]; nested: IllegalStateException[try to recover [stats_new][1] from primary shard with sync id but number of docs differ: 29983940 (srv4-sv, primary) vs 29983938(srv4-ch)]; ",
                  "allocation_status": "no_attempt"
                }
              }
            ],
            ...
          }
        },
        ...
      }
    },
    "routing_nodes": {
      "unassigned": [
        {
          "state": "UNASSIGNED",
          "primary": false,
          "node": null,
          "relocating_node": null,
          "shard": 1,
          "index": "stats_new",
          "recovery_source": {
            "type": "PEER"
          },
          "unassigned_info": {
            "reason": "ALLOCATION_FAILED",
            "at": "2018-08-03T22:38:21.312Z",
            "failed_attempts": 5,
            "delayed": false,
            "details": "failed recovery, failure RecoveryFailedException[[stats_new][1]: Recovery failed from {srv4-sv}{w_6a_qIfTN2L8BN0wGiCyQ}{0KxMMAQETZywBTVNtriTkA}{10.64.2.17}{10.64.2.17:9300} into {srv4-ch}{7iSlk_IzRleE3AQag6cAfA}{i5S74TwfQeKlih5okbSjvQ}{10.64.3.17}{10.64.3.17:9300}]; nested: RemoteTransportException[[srv4-sv][10.64.2.17:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [0] files with total size of [0b]]; nested: IllegalStateException[try to recover [stats_new][1] from primary shard with sync id but number of docs differ: 29983940 (srv4-sv, primary) vs 29983938(srv4-ch)]; ",
            "allocation_status": "no_attempt"
          }
        }
      ],
      ...
    },
    "snapshots": {
      "snapshots": []
    }
  }
}

jethroft · August 6, 2018, 7:30pm

Did it again with retry_failed=true (was missing =true before), now it's fixed. Thanks!

system · September 3, 2018, 7:30pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch cluster status is red. Allocate missing primary shards and replica shards Elasticsearch	2	12317	January 13, 2019
Shard unassigned Elasticsearch	3	367	July 14, 2020
UNASSIGNED replicas after reroute allocate_stale_primary Elasticsearch	1	510	March 18, 2022
Elasticsearch Cluster Yellow - Index Allocation "No Attempt" Elasticsearch	6	1191	August 17, 2023
Primary Shard Allocation_Failed Elasticsearch	5	1379	October 24, 2022

[allocate_stale_primary] primary is already assigned

Related topics