Help! After upgrading to Elasticsearch 6 cluster shard replicas will not allocate

Warren_Turner · November 17, 2017, 7:20am

I upgraded my cluster to Elasticsearch 6 and while it is working, none of the redundant shards are being allocated

When I use the explain API I get this:

curl -XGET 'localhost:9200/_cluster/allocation/explain?pretty'
{
  "index" : "px-ext-access-2016.09.01",
  "shard" : 4,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "REPLICA_ADDED",
    "at" : "2017-11-17T07:02:48.361Z",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "8SjCHI-OQn-mBAl0OBLx6Q",
      "node_name" : "bos1-es2",
      "transport_address" : "192.168.8.151:9300",
      "node_attributes" : {
        "ml.max_open_jobs" : "10",
        "ml.enabled" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "enable",
          "decision" : "NO",
          "explanation" : "no allocations are allowed due to {}"
        }
      ]
    },
    {
      "node_id" : "JR9qqMjEQF6c22eGZqIAcw",
      "node_name" : "bos1-es3",
      "transport_address" : "192.168.8.152:9300",
      "node_attributes" : {
        "ml.max_open_jobs" : "10",
        "ml.enabled" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "enable",
          "decision" : "NO",
          "explanation" : "no allocations are allowed due to {}"
        },
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[px-ext-access-2016.09.01][4], node[JR9qqMjEQF6c22eGZqIAcw], [P], s[STARTED], a[id=IO9Hx2mpScqfigUPaX7mkQ]]"
        }
      ]
    },
    {
      "node_id" : "tBZdi1W_TRa16getb8eNlA",
      "node_name" : "bos1-es1",
      "transport_address" : "192.168.8.150:9300",
      "node_attributes" : {
        "ml.max_open_jobs" : "10",
        "ml.enabled" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "enable",
          "decision" : "NO",
          "explanation" : "no allocations are allowed due to {}"
        }
      ]
    }
  ]
}

Help what do I do! I checked and all the cluster members are running 6.0.0 so I don't think there is a version mismatch issue.

warkolm · November 17, 2017, 7:26am

Did you disable allocation when you started the upgrade? If so did you re-enable it?

Warren_Turner · November 17, 2017, 7:50am

I figured it out just a minute ago. I had disabled allocation and re-enabled it but it didn't do anything. I tried "disabling" it and the re-enabling it again and it started correctly. Hurray for turning it off and on again

Warren_Turner · November 17, 2017, 4:11pm

Uhoh, It looks like there are a few shards that still aren't getting allocated. I get this error:

curl -XGET 'localhost:9200/_cluster/allocation/explain?pretty'
{
  "index" : "px-web-server-2017.11.17",
  "shard" : 4,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2017-11-17T07:27:37.708Z",
    "failed_allocation_attempts" : 5,
    "details" : "failed recovery, failure RecoveryFailedException[[px-web-server-2017.11.17][4]: Recovery failed from {bos1-es3}{JR9qqMjEQF6c22eGZqIAcw}{izoNnZpWTsyt5_xxymVDbg}{192.168.8.152}{192.168.8.152:9300}{ml.max_open_jobs=10, ml.enabled=true} into {bos1-es1}{tBZdi1W_TRa16getb8eNlA}{picaUSZyS7K34rTVpJvu1Q}{192.168.8.150}{192.168.8.150:9300}{ml.max_open_jobs=10, ml.enabled=true}]; nested: RemoteTransportException[[bos1-es3][192.168.8.152:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[2] phase2 failed]; nested: RemoteTransportException[[bos1-es1][192.168.8.150:9300][internal:index/shard/recovery/translog_ops]]; nested: TranslogException[Failed to write operation [Index{id='AV_HS9TBPU9Vhf7shs_U', type='px-web-server', seqNo=-2, primaryTerm=0}]]; nested: IllegalArgumentException[sequence number must be assigned]; ",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "8SjCHI-OQn-mBAl0OBLx6Q",
      "node_name" : "bos1-es2",
      "transport_address" : "192.168.8.151:9300",
      "node_attributes" : {
        "ml.max_open_jobs" : "10",
        "ml.enabled" : "true"
      },
      "node_decision" : "no",
      "store" : {
        "matching_size_in_bytes" : 0
      },
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2017-11-17T07:27:37.708Z], failed_attempts[5], delayed=false, details[failed recovery, failure RecoveryFailedException[[px-web-server-2017.11.17][4]: Recovery failed from {bos1-es3}{JR9qqMjEQF6c22eGZqIAcw}{izoNnZpWTsyt5_xxymVDbg}{192.168.8.152}{192.168.8.152:9300}{ml.max_open_jobs=10, ml.enabled=true} into {bos1-es1}{tBZdi1W_TRa16getb8eNlA}{picaUSZyS7K34rTVpJvu1Q}{192.168.8.150}{192.168.8.150:9300}{ml.max_open_jobs=10, ml.enabled=true}]; nested: RemoteTransportException[[bos1-es3][192.168.8.152:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[2] phase2 failed]; nested: RemoteTransportException[[bos1-es1][192.168.8.150:9300][internal:index/shard/recovery/translog_ops]]; nested: TranslogException[Failed to write operation [Index{id='AV_HS9TBPU9Vhf7shs_U', type='px-web-server', seqNo=-2, primaryTerm=0}]]; nested: IllegalArgumentException[sequence number must be assigned]; ], allocation_status[no_attempt]]]"
        }
      ]
    },

Warren_Turner · November 17, 2017, 4:18pm

I tried running the retry command that the error suggests but I get:

"px-web-server-2017.11.17" : {
          "shards" : {
            "2" : [
              {
                "state" : "STARTED",
                "primary" : true,
                "node" : "JR9qqMjEQF6c22eGZqIAcw",
                "relocating_node" : null,
                "shard" : 2,
                "index" : "px-web-server-2017.11.17",
                "allocation_id" : {
                  "id" : "9X4ESJ3CS1OOnE3Y6V-KPA"
                }
              },
              {
                "state" : "UNASSIGNED",
                "primary" : false,
                "node" : null,
                "relocating_node" : null,
                "shard" : 2,
                "index" : "px-web-server-2017.11.17",
                "recovery_source" : {
                  "type" : "PEER"
                },
                "unassigned_info" : {
                  "reason" : "ALLOCATION_FAILED",
                  "at" : "2017-11-17T16:15:11.973Z",
                  "failed_attempts" : 6,
                  "delayed" : false,
                  "details" : "failed recovery, failure RecoveryFailedException[[px-web-server-2017.11.17][2]: Recovery failed from {bos1-es3}{JR9qqMjEQF6c22eGZqIAcw}{izoNnZpWTsyt5_xxymVDbg}{192.\
168.8.152}{192.168.8.152:9300}{ml.max_open_jobs=10, ml.enabled=true} into {bos1-es1}{tBZdi1W_TRa16getb8eNlA}{picaUSZyS7K34rTVpJvu1Q}{192.168.8.150}{192.168.8.150:9300}{ml.max_open_jobs=10, ml.ena\
bled=true}]; nested: RemoteTransportException[[bos1-es3][192.168.8.152:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[2] phase2 failed]; nested: Remot\
eTransportException[[bos1-es1][192.168.8.150:9300][internal:index/shard/recovery/translog_ops]]; nested: TranslogException[Failed to write operation [Index{id='AV_HSw2kXITtF5uOUXel', type='px-web\
-server', seqNo=-2, primaryTerm=0}]]; nested: IllegalArgumentException[sequence number must be assigned]; ",
                  "allocation_status" : "no_attempt"
                }
              }
            ],

Warren_Turner · November 17, 2017, 5:25pm

I think I'm just going to call it quits with this issue. It is only one index and it was being created during the rolling restart of the cluster so I bet it is in some screwy state. I'll just reindex it and not worry about it anymore

warkolm · November 17, 2017, 8:02pm

If it's a replica, just set the replica count to 0 on the index, then add it back.

Warren_Turner · November 17, 2017, 9:52pm

Hey that worked! Thanks Mark!

bleskes · November 27, 2017, 1:32pm

@Warren_Turner can you please confirm that you did a rolling upgrade?

Please see https://github.com/elastic/elasticsearch/issues/27536 for a description of why we think this happens.

system · December 25, 2017, 1:32pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Replica Shard is in unallocated state after upgrade to 6.0 from 5.6.0 Elasticsearch	7	1960	January 3, 2018
Problems with Shard allocation Elasticsearch docker	3	1109	November 11, 2021
ES upgrade shards issue - ES 5.6.10 to ES 6.3 Elasticsearch	5	1000	July 23, 2018
Shards Not Being Allocated To Nodes Elasticsearch	7	2272	November 11, 2021
Newly created index has unallocated shards Elasticsearch	16	1594	January 27, 2017

Help! After upgrading to Elasticsearch 6 cluster shard replicas will not allocate

Related topics