Help! After upgrading to Elasticsearch 6 cluster shard replicas will not allocate

I upgraded my cluster to Elasticsearch 6 and while it is working, none of the redundant shards are being allocated :frowning:

When I use the explain API I get this:

curl -XGET 'localhost:9200/_cluster/allocation/explain?pretty'
{
  "index" : "px-ext-access-2016.09.01",
  "shard" : 4,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "REPLICA_ADDED",
    "at" : "2017-11-17T07:02:48.361Z",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "8SjCHI-OQn-mBAl0OBLx6Q",
      "node_name" : "bos1-es2",
      "transport_address" : "192.168.8.151:9300",
      "node_attributes" : {
        "ml.max_open_jobs" : "10",
        "ml.enabled" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "enable",
          "decision" : "NO",
          "explanation" : "no allocations are allowed due to {}"
        }
      ]
    },
    {
      "node_id" : "JR9qqMjEQF6c22eGZqIAcw",
      "node_name" : "bos1-es3",
      "transport_address" : "192.168.8.152:9300",
      "node_attributes" : {
        "ml.max_open_jobs" : "10",
        "ml.enabled" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "enable",
          "decision" : "NO",
          "explanation" : "no allocations are allowed due to {}"
        },
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[px-ext-access-2016.09.01][4], node[JR9qqMjEQF6c22eGZqIAcw], [P], s[STARTED], a[id=IO9Hx2mpScqfigUPaX7mkQ]]"
        }
      ]
    },
    {
      "node_id" : "tBZdi1W_TRa16getb8eNlA",
      "node_name" : "bos1-es1",
      "transport_address" : "192.168.8.150:9300",
      "node_attributes" : {
        "ml.max_open_jobs" : "10",
        "ml.enabled" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "enable",
          "decision" : "NO",
          "explanation" : "no allocations are allowed due to {}"
        }
      ]
    }
  ]
}

Help what do I do! I checked and all the cluster members are running 6.0.0 so I don't think there is a version mismatch issue.

Did you disable allocation when you started the upgrade? If so did you re-enable it?

I figured it out just a minute ago. I had disabled allocation and re-enabled it but it didn't do anything. I tried "disabling" it and the re-enabling it again and it started correctly. Hurray for turning it off and on again :smiley:

2 Likes

Uhoh, It looks like there are a few shards that still aren't getting allocated. I get this error:

curl -XGET 'localhost:9200/_cluster/allocation/explain?pretty'
{
  "index" : "px-web-server-2017.11.17",
  "shard" : 4,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2017-11-17T07:27:37.708Z",
    "failed_allocation_attempts" : 5,
    "details" : "failed recovery, failure RecoveryFailedException[[px-web-server-2017.11.17][4]: Recovery failed from {bos1-es3}{JR9qqMjEQF6c22eGZqIAcw}{izoNnZpWTsyt5_xxymVDbg}{192.168.8.152}{192.168.8.152:9300}{ml.max_open_jobs=10, ml.enabled=true} into {bos1-es1}{tBZdi1W_TRa16getb8eNlA}{picaUSZyS7K34rTVpJvu1Q}{192.168.8.150}{192.168.8.150:9300}{ml.max_open_jobs=10, ml.enabled=true}]; nested: RemoteTransportException[[bos1-es3][192.168.8.152:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[2] phase2 failed]; nested: RemoteTransportException[[bos1-es1][192.168.8.150:9300][internal:index/shard/recovery/translog_ops]]; nested: TranslogException[Failed to write operation [Index{id='AV_HS9TBPU9Vhf7shs_U', type='px-web-server', seqNo=-2, primaryTerm=0}]]; nested: IllegalArgumentException[sequence number must be assigned]; ",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "8SjCHI-OQn-mBAl0OBLx6Q",
      "node_name" : "bos1-es2",
      "transport_address" : "192.168.8.151:9300",
      "node_attributes" : {
        "ml.max_open_jobs" : "10",
        "ml.enabled" : "true"
      },
      "node_decision" : "no",
      "store" : {
        "matching_size_in_bytes" : 0
      },
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2017-11-17T07:27:37.708Z], failed_attempts[5], delayed=false, details[failed recovery, failure RecoveryFailedException[[px-web-server-2017.11.17][4]: Recovery failed from {bos1-es3}{JR9qqMjEQF6c22eGZqIAcw}{izoNnZpWTsyt5_xxymVDbg}{192.168.8.152}{192.168.8.152:9300}{ml.max_open_jobs=10, ml.enabled=true} into {bos1-es1}{tBZdi1W_TRa16getb8eNlA}{picaUSZyS7K34rTVpJvu1Q}{192.168.8.150}{192.168.8.150:9300}{ml.max_open_jobs=10, ml.enabled=true}]; nested: RemoteTransportException[[bos1-es3][192.168.8.152:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[2] phase2 failed]; nested: RemoteTransportException[[bos1-es1][192.168.8.150:9300][internal:index/shard/recovery/translog_ops]]; nested: TranslogException[Failed to write operation [Index{id='AV_HS9TBPU9Vhf7shs_U', type='px-web-server', seqNo=-2, primaryTerm=0}]]; nested: IllegalArgumentException[sequence number must be assigned]; ], allocation_status[no_attempt]]]"
        }
      ]
    },

I tried running the retry command that the error suggests but I get:

"px-web-server-2017.11.17" : {
          "shards" : {
            "2" : [
              {
                "state" : "STARTED",
                "primary" : true,
                "node" : "JR9qqMjEQF6c22eGZqIAcw",
                "relocating_node" : null,
                "shard" : 2,
                "index" : "px-web-server-2017.11.17",
                "allocation_id" : {
                  "id" : "9X4ESJ3CS1OOnE3Y6V-KPA"
                }
              },
              {
                "state" : "UNASSIGNED",
                "primary" : false,
                "node" : null,
                "relocating_node" : null,
                "shard" : 2,
                "index" : "px-web-server-2017.11.17",
                "recovery_source" : {
                  "type" : "PEER"
                },
                "unassigned_info" : {
                  "reason" : "ALLOCATION_FAILED",
                  "at" : "2017-11-17T16:15:11.973Z",
                  "failed_attempts" : 6,
                  "delayed" : false,
                  "details" : "failed recovery, failure RecoveryFailedException[[px-web-server-2017.11.17][2]: Recovery failed from {bos1-es3}{JR9qqMjEQF6c22eGZqIAcw}{izoNnZpWTsyt5_xxymVDbg}{192.\
168.8.152}{192.168.8.152:9300}{ml.max_open_jobs=10, ml.enabled=true} into {bos1-es1}{tBZdi1W_TRa16getb8eNlA}{picaUSZyS7K34rTVpJvu1Q}{192.168.8.150}{192.168.8.150:9300}{ml.max_open_jobs=10, ml.ena\
bled=true}]; nested: RemoteTransportException[[bos1-es3][192.168.8.152:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[2] phase2 failed]; nested: Remot\
eTransportException[[bos1-es1][192.168.8.150:9300][internal:index/shard/recovery/translog_ops]]; nested: TranslogException[Failed to write operation [Index{id='AV_HSw2kXITtF5uOUXel', type='px-web\
-server', seqNo=-2, primaryTerm=0}]]; nested: IllegalArgumentException[sequence number must be assigned]; ",
                  "allocation_status" : "no_attempt"
                }
              }
            ],

I think I'm just going to call it quits with this issue. It is only one index and it was being created during the rolling restart of the cluster so I bet it is in some screwy state. I'll just reindex it and not worry about it anymore :wink:

If it's a replica, just set the replica count to 0 on the index, then add it back.

1 Like

Hey that worked! Thanks Mark!

@Warren_Turner can you please confirm that you did a rolling upgrade?

Please see https://github.com/elastic/elasticsearch/issues/27536 for a description of why we think this happens.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.