[allocate_stale_primary] primary is already assigned

My ES 5.6.3 cluster is yellow:

administrator@srv4-sv:~$ curl -XGET 'localhost:9200/_cluster/health?pretty'
{
  "cluster_name" : "redacted",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 15,
  "number_of_data_nodes" : 15,
  "active_primary_shards" : 172,
  "active_shards" : 350,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 1,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 99.71509971509973
}

administrator@srv4-sv:~$ curl -sS -XGET localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED
stats_new             1 r UNASSIGNED ALLOCATION_FAILED

administrator@srv4-sv:~$ curl -sS -X POST 'localhost:9200/_cluster/allocation/explain?pretty'
{
  "index" : "stats_new",
  "shard" : 1,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2018-08-03T22:38:21.312Z",
    "failed_allocation_attempts" : 5,
    "details" : "failed recovery, failure RecoveryFailedException[[stats_new][1]: Recovery failed from {srv4-sv}{w_6a_qIfTN2L8BN0wGiCyQ}{0KxMMAQETZywBTVNtriTkA}{10.64.2.17}{10.64.2.17:9300} into {srv4-ch}{7iSlk_IzRleE3AQag6cAfA}{i5S74TwfQeK
lih5okbSjvQ}{10.64.3.17}{10.64.3.17:9300}]; nested: RemoteTransportException[[srv4-sv][10.64.2.17:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to
 transfer [0] files with total size of [0b]]; nested: IllegalStateException[try to recover [stats_new][1] from primary shard with sync id but number of docs differ: 29983940 (srv4-sv, primary) vs 29983938(srv4-ch)]; ",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [ ... ]
}

According to the Internet, the fix is this:

administrator@srv4-sv:~$ curl -sS -XPOST 'localhost:9200/_cluster/reroute' -d '{"commands":[{"allocate_stale_primary":{"index":"stats_new","shard":1,"node":"srv4-sv","accept_data_loss":true}}]}'

but that doesn't work:

{"error":{"root_cause":[{"type":"remote_transport_exception","reason":"[srv6-sv][10.64.2.21:9300][cluster:admin/reroute]"}],"type":"illegal_argument_exception","reason":"[allocate_stale_primary] primary [stats_new][1] is already assigned"},"status":400}

Thoughts?

(bump)

Are all your nodes using exactly the same version of Elasticsearch (use the cat nodes API to check)?

Yes, they're all on 5.6.3

@jethroft You need to remove the sync_id, then retry allocation.

  1. Issue a force flush to remove the sync-id: POST /stats_new/_flush?force=true
  2. Issue a cluster reroute to retry allocation: POST /_cluster/reroute?retry_failed

I hope this helps.

1 Like

@nhat Thanks. I tried that, but I don't think it did anything:

$ curl -sS -XPOST 'localhost:9200/stats_new/_flush?force=true'

{"_shards":{"total":10,"successful":9,"failed":0}}

$ curl -sS -XPOST 'localhost:9200/_cluster/reroute?retry_failed'

{
  "acknowledged": true,
  "state": {
    "version": 21335,
    "state_uuid": "WZNP_dwGQ1iXfpvKMfZHBg",
    "master_node": "-qnu-AcDSyyuSQZbZ_ykfQ",
    "blocks": {},
    "nodes": { ... },
    "routing_table": {
      "indices": {
        ...
        "stats_new": {
          "shards": {
            ...
            "1": [
              {
                "state": "STARTED",
                "primary": true,
                "node": "w_6a_qIfTN2L8BN0wGiCyQ",
                "relocating_node": null,
                "shard": 1,
                "index": "stats_new",
                "allocation_id": {
                  "id": "ikEbwB1lQPuBCLUeG96VzA"
                }
              },
              {
                "state": "UNASSIGNED",
                "primary": false,
                "node": null,
                "relocating_node": null,
                "shard": 1,
                "index": "stats_new",
                "recovery_source": {
                  "type": "PEER"
                },
                "unassigned_info": {
                  "reason": "ALLOCATION_FAILED",
                  "at": "2018-08-03T22:38:21.312Z",
                  "failed_attempts": 5,
                  "delayed": false,
                  "details": "failed recovery, failure RecoveryFailedException[[stats_new][1]: Recovery failed from {srv4-sv}{w_6a_qIfTN2L8BN0wGiCyQ}{0KxMMAQETZywBTVNtriTkA}{10.64.2.17}{10.64.2.17:9300} into {srv4-ch}{7iSlk_IzRleE3AQag6cAfA}{i5S74TwfQeKlih5okbSjvQ}{10.64.3.17}{10.64.3.17:9300}]; nested: RemoteTransportException[[srv4-sv][10.64.2.17:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [0] files with total size of [0b]]; nested: IllegalStateException[try to recover [stats_new][1] from primary shard with sync id but number of docs differ: 29983940 (srv4-sv, primary) vs 29983938(srv4-ch)]; ",
                  "allocation_status": "no_attempt"
                }
              }
            ],
            ...
          }
        },
        ...
      }
    },
    "routing_nodes": {
      "unassigned": [
        {
          "state": "UNASSIGNED",
          "primary": false,
          "node": null,
          "relocating_node": null,
          "shard": 1,
          "index": "stats_new",
          "recovery_source": {
            "type": "PEER"
          },
          "unassigned_info": {
            "reason": "ALLOCATION_FAILED",
            "at": "2018-08-03T22:38:21.312Z",
            "failed_attempts": 5,
            "delayed": false,
            "details": "failed recovery, failure RecoveryFailedException[[stats_new][1]: Recovery failed from {srv4-sv}{w_6a_qIfTN2L8BN0wGiCyQ}{0KxMMAQETZywBTVNtriTkA}{10.64.2.17}{10.64.2.17:9300} into {srv4-ch}{7iSlk_IzRleE3AQag6cAfA}{i5S74TwfQeKlih5okbSjvQ}{10.64.3.17}{10.64.3.17:9300}]; nested: RemoteTransportException[[srv4-sv][10.64.2.17:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [0] files with total size of [0b]]; nested: IllegalStateException[try to recover [stats_new][1] from primary shard with sync id but number of docs differ: 29983940 (srv4-sv, primary) vs 29983938(srv4-ch)]; ",
            "allocation_status": "no_attempt"
          }
        }
      ],
      ...
    },
    "snapshots": {
      "snapshots": []
    }
  }
}

Did it again with retry_failed=true (was missing =true before), now it's fixed. Thanks!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.