Unassigned shard with inconsistent primary state and doc count differences

I have a three node ES cluster running 5.6.5. Last week I restarted the servers in the cluster to make changes to their disk configuration, I did not follow any process to do this but after 12 hours the cluster and indexes were back into a green state. Yesterday I restarted one server, following the rolling upgrade process but without upgrading the server. Indexing (logstash) is still stopped and no documents are being added to the system.

I ended up in a state with sync issues, which appears to be identical to this post, I are also running a similarly antiquated version, a situation I are working to address.

I attempted to reroute the bad index, then to remove and re-add replicas, which resulted in a wealth of failures and us being in the following state:

corp_b 0 r STARTED    125766000 131.5gb 162.211.235.20 pcorplog3
corp_b 0 p STARTED    125766000 131.5gb 162.211.235.11 pcorplog2
corp_b 1 p STARTED    125746770 128.6gb 162.211.235.11 pcorplog2
corp_b 1 r UNASSIGNED
corp_b 2 p STARTED    125765953 130.9gb 162.211.235.20 pcorplog3
corp_b 2 r STARTED    125765953 130.9gb 162.211.235.10 pcorplog1
corp_b 3 r STARTED    125765360 132.1gb 162.211.235.11 pcorplog2
corp_b 3 p STARTED    125765360 132.1gb 162.211.235.10 pcorplog1
corp_b 4 r STARTED    125752747 133.2gb 162.211.235.11 pcorplog2
corp_b 4 p STARTED    125752747 133.2gb 162.211.235.10 pcorplog1
corp_b 5 p STARTED    125771560   128gb 162.211.235.20 pcorplog3
corp_b 5 r STARTED    125771560   128gb 162.211.235.11 pcorplog2

This report, from _cat/shards, indicates that the primary shard for this index is located on pcorplog2 and the replica is unassigned. I see the same information when querying _cat/shards on all three cluster members.

However, when I look at _cluster/allocation/explain?pretty I seem to get a different view as to the primary index of this shard; I receive failure reports from the three nodes as follows:

pcorplog1: IllegalStateException[try to recover [corp_b][1] from primary shard with sync id but number of docs differ: 125852748 (pcorplog3, primary) vs 125852744(pcorplog1)
pcorplog2: IllegalStateException[try to recover [corp_b][1] from primary shard with sync id but number of docs differ: 125852748 (pcorplog3, primary) vs 125852744(pcorplog1)
pcorplog3: IllegalStateException[try to recover [corp_b][1] from primary shard with sync id but number of docs differ: 125852748 (pcorplog3, primary) vs 125852744(pcorplog1)

I'm very eager to get this issue solved, thanks in advance for any assistance you can offer, full allocation explain follows in a second post.

{
  "index" : "corp_b",
  "shard" : 1,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2019-10-15T19:57:35.647Z",
    "failed_allocation_attempts" : 5,
    "details" : "failed recovery, failure RecoveryFailedException[[corp_b][1]: Recovery failed from {pcorplog3}{5XIoN5ESQBeUL54_RK5KQQ}{Z865bkClQUu1v0gpEBJsdQ}{162.211.235.20}{162.211.235.20:9300}{ml.max_open_jobs=10, ml.enabled=true} into {pcorplog1}{m7D5fhFhR0W5jzqZP5GExA}{xiwBLQAzT3qiziiTxVgNuA}{162.211.235.10}{162.211.235.10:9300}{ml.max_open_jobs=10, ml.enabled=true}]; nested: RemoteTransportException[[pcorplog3][162.211.235.20:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [0] files with total size of [0b]]; nested: IllegalStateException[try to recover [corp_b][1] from primary shard with sync id but number of docs differ: 125852748 (pcorplog3, primary) vs 125852744(pcorplog1)]; ",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "5XIoN5ESQBeUL54_RK5KQQ",
      "node_name" : "pcorplog3",
      "transport_address" : "162.211.235.20:9300",
      "node_attributes" : {
        "ml.max_open_jobs" : "10",
        "ml.enabled" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2019-10-15T19:57:35.647Z], failed_attempts[5], delayed=false, details[failed recovery, failure RecoveryFailedException[[corp_b][1]: Recovery failed from {pcorplog3}{5XIoN5ESQBeUL54_RK5KQQ}{Z865bkClQUu1v0gpEBJsdQ}{162.211.235.20}{162.211.235.20:9300}{ml.max_open_jobs=10, ml.enabled=true} into {pcorplog1}{m7D5fhFhR0W5jzqZP5GExA}{xiwBLQAzT3qiziiTxVgNuA}{162.211.235.10}{162.211.235.10:9300}{ml.max_open_jobs=10, ml.enabled=true}]; nested: RemoteTransportException[[pcorplog3][162.211.235.20:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [0] files with total size of [0b]]; nested: IllegalStateException[try to recover [corp_b][1] from primary shard with sync id but number of docs differ: 125852748 (pcorplog3, primary) vs 125852744(pcorplog1)]; ], allocation_status[no_attempt]]]"
        }
      ]
    },
    {
      "node_id" : "m7D5fhFhR0W5jzqZP5GExA",
      "node_name" : "pcorplog1",
      "transport_address" : "162.211.235.10:9300",
      "node_attributes" : {
        "ml.max_open_jobs" : "10",
        "ml.enabled" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2019-10-15T19:57:35.647Z], failed_attempts[5], delayed=false, details[failed recovery, failure RecoveryFailedException[[corp_b][1]: Recovery failed from {pcorplog3}{5XIoN5ESQBeUL54_RK5KQQ}{Z865bkClQUu1v0gpEBJsdQ}{162.211.235.20}{162.211.235.20:9300}{ml.max_open_jobs=10, ml.enabled=true} into {pcorplog1}{m7D5fhFhR0W5jzqZP5GExA}{xiwBLQAzT3qiziiTxVgNuA}{162.211.235.10}{162.211.235.10:9300}{ml.max_open_jobs=10, ml.enabled=true}]; nested: RemoteTransportException[[pcorplog3][162.211.235.20:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [0] files with total size of [0b]]; nested: IllegalStateException[try to recover [corp_b][1] from primary shard with sync id but number of docs differ: 125852748 (pcorplog3, primary) vs 125852744(pcorplog1)]; ], allocation_status[no_attempt]]]"
        }
      ]
    },
    {
      "node_id" : "vZxO4e79QjC3TQzOTapg1Q",
      "node_name" : "pcorplog2",
      "transport_address" : "162.211.235.11:9300",
      "node_attributes" : {
        "ml.max_open_jobs" : "10",
        "ml.enabled" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2019-10-15T19:57:35.647Z], failed_attempts[5], delayed=false, details[failed recovery, failure RecoveryFailedException[[corp_b][1]: Recovery failed from {pcorplog3}{5XIoN5ESQBeUL54_RK5KQQ}{Z865bkClQUu1v0gpEBJsdQ}{162.211.235.20}{162.211.235.20:9300}{ml.max_open_jobs=10, ml.enabled=true} into {pcorplog1}{m7D5fhFhR0W5jzqZP5GExA}{xiwBLQAzT3qiziiTxVgNuA}{162.211.235.10}{162.211.235.10:9300}{ml.max_open_jobs=10, ml.enabled=true}]; nested: RemoteTransportException[[pcorplog3][162.211.235.20:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [0] files with total size of [0b]]; nested: IllegalStateException[try to recover [corp_b][1] from primary shard with sync id but number of docs differ: 125852748 (pcorplog3, primary) vs 125852744(pcorplog1)]; ], allocation_status[no_attempt]]]"
        },
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[corp_b][1], node[vZxO4e79QjC3TQzOTapg1Q], [P], s[STARTED], a[id=J7YE4wsKRgK8OQEwzKJDYA]]"
        }
      ]
    }
  ]
}

How soon after removing the replicas did you add them back again?

Initially the replicas of shards 3 and 4 were moving between nodes 2 and 3, fluctatuting between unassigned and initiating states. We attempted to cancel and move them using the cluster/reroute API. After this didn't have any apparent impact we attempted to set replicas to 0. This resolved the issues for shards 3 and 4, the cluster attmpted to relocate shard 1 to a different node to balance the cluster. We waited for about two hours and the situation wasn't improving so we set the nummber of replicas to 1. This resulted in the following state with many errors in the following hours.

corp_b 0 r INITIALIZING                   162.211.235.20 pcorplog3
corp_b 0 p STARTED      125871592 131.5gb 162.211.235.11 pcorplog2
corp_b 1 p RELOCATING   125852748 128.6gb 162.211.235.20 pcorplog3 -> 162.211.235.11 vZxO4e79QjC3TQzOTapg1Q pcorplog2
corp_b 1 r UNASSIGNED
corp_b 2 p STARTED      125871918 130.9gb 162.211.235.20 pcorplog3
corp_b 2 r INITIALIZING                   162.211.235.10 pcorplog1
corp_b 3 r INITIALIZING                   162.211.235.11 pcorplog2
corp_b 3 p STARTED      125870697 132.1gb 162.211.235.10 pcorplog1
corp_b 4 r INITIALIZING                   162.211.235.11 pcorplog2
corp_b 4 p STARTED      125858400 133.2gb 162.211.235.10 pcorplog1
corp_b 5 p STARTED      125877705   128gb 162.211.235.20 pcorplog3
corp_b 5 r INITIALIZING                   162.211.235.11 pcorplog2

This morning we found ourselves in the state you see described in my post above.

EDIT: We also didn't see any data transfer occurring up until we re-added the replicas and even this transfer didn't occur immediately, there was some significant delay before it started.

We have a few gigs of elasticsearch.log files, if there are specific strings we can search for to clarify the state we'll be happy to provide them.

Thanks very much for your help!

Ok, interesting. I suspect the ongoing rebalance attempts may have prevented the out-of-sync copy from being cleaned up properly, because the cleanup only happens if the shard is fully assigned and none of the copies are relocating. I would suggest disabling rebalancing and then setting the number of replicas back down to 0 to avoid the relocation.

Ok, great, I'll try that shortly and report back on the resulting state. We'll ensure that we let you know what's going on.

Just to clarify by disabling rebalancing you mean disabling shard allocation using the following setting, or is there another setting we should adjust?

    "cluster.routing.allocation.enable": "none"

Thanks again!

Yes that's right.

Agh, sorry, I misread the vital word! The correct setting is "cluster.routing.rebalance.enable": "none"

Ok, it's looking like that worked!!!

corp_b 0 p STARTED 125766000 131.5gb 162.211.235.11 pcorplog2
corp_b 1 p STARTED 125746770 128.6gb 162.211.235.11 pcorplog2
corp_b 2 p STARTED 125765953 130.9gb 162.211.235.20 pcorplog3
corp_b 3 p STARTED 125765360 132.1gb 162.211.235.10 pcorplog1
corp_b 4 p STARTED 125752747 133.2gb 162.211.235.10 pcorplog1
corp_b 5 p STARTED 125771560   128gb 162.211.235.20 pcorplog3

I should be able to bring the replica count back up to 1 now and have ES add the replicas?

Yes, let's have those replicas back. I expect them to allocate cleanly but it would be good to hear either way.

I've increased the replicas back to 1 and it appears to be initializing, we're following the status closely and will update the thread later today.

corp_b 0 r INITIALIZING                   162.211.235.20 pcorplog3
corp_b 0 p STARTED      125766000 131.5gb 162.211.235.11 pcorplog2
corp_b 1 p STARTED      125746770 128.6gb 162.211.235.11 pcorplog2
corp_b 1 r INITIALIZING                   162.211.235.10 pcorplog1
corp_b 2 p STARTED      125765953 130.9gb 162.211.235.20 pcorplog3
corp_b 2 r INITIALIZING                   162.211.235.11 pcorplog2
corp_b 3 r INITIALIZING                   162.211.235.20 pcorplog3
corp_b 3 p STARTED      125765360 132.1gb 162.211.235.10 pcorplog1
corp_b 4 r INITIALIZING                   162.211.235.11 pcorplog2
corp_b 4 p STARTED      125752747 133.2gb 162.211.235.10 pcorplog1
corp_b 5 p STARTED      125771560   128gb 162.211.235.20 pcorplog3
corp_b 5 r INITIALIZING                   162.211.235.10 pcorplog1

Everything is looking pretty good and I can see data being transferred between the nodes on the network but I don't really know how much time is left.

Is it possible to get a status on the shard replication status and percentage complete? I can see some details in /_cluster/health/corp_b?level=shards but it doesn't give me any indication of the transfer progress.

Thanks again!

Yes, the index recovery API is the place to look for more details.

The recovery is proceeding well for some shards, but I have some concerns about shard 0.

Here's the current _cat/shards status:

corp_b 2 p STARTED      125765953 130.9gb 162.211.235.20 pcorplog3
corp_b 2 r STARTED      125765953 130.9gb 162.211.235.11 pcorplog2
corp_b 4 r INITIALIZING                   162.211.235.11 pcorplog2
corp_b 4 p STARTED      125752747 133.2gb 162.211.235.10 pcorplog1
corp_b 5 p STARTED      125771560   128gb 162.211.235.20 pcorplog3
corp_b 5 r STARTED      125771560   128gb 162.211.235.10 pcorplog1
corp_b 3 r INITIALIZING                   162.211.235.20 pcorplog3
corp_b 3 p STARTED      125765360 132.1gb 162.211.235.10 pcorplog1
corp_b 1 p STARTED      125746770 128.6gb 162.211.235.11 pcorplog2
corp_b 1 r STARTED      125746770 128.6gb 162.211.235.10 pcorplog1
corp_b 0 r INITIALIZING                   162.211.235.20 pcorplog3
corp_b 0 p STARTED      125766000 131.5gb 162.211.235.11 pcorplog2

When we break out the index size and files information from the corp_b/_recovery status we see the following:

johnf@DESKTOP-KH3IF90:~/Documents/src/corp-central-monitoring/script/es$ ./get-index-recovery-status.sh corp_b | jq '.corp_b.shards[] | select(.id==0) | [.id, .index.size, .index.files]'
[
  0,
  {
    "total_in_bytes": 141264662420,
    "reused_in_bytes": 0,
    "recovered_in_bytes": 134678320808,
    "percent": "95.3%"
  },
  {
    "total": 194,
    "reused": 0,
    "recovered": 192,
    "percent": "99.0%"
  }
]
[
  0,
  {
    "total_in_bytes": 0,
    "reused_in_bytes": 0,
    "recovered_in_bytes": 0,
    "percent": "0.0%"
  },
  {
    "total": 0,
    "reused": 0,
    "recovered": 0,
    "percent": "0.0%"
  }
]

The recovery percentage is increasing but I'm still concerned by the state seen here.

I'll update again tomorrow morning unless there are further questions, if you want more information from the recovery status please let me know. Really appreciate your assistance!

I've got the cluster back online and everything is operating correctly, however I do have some questions about the recovery information shown in the index section of the recovery document. Some of the shards show a state with files and byte values that are close to each other on the two replicas of the shard, however others have a state like this where we see a size of 0 and 0 files on an index, here's the output from /_recovery:

      {
        "id": 0,
        [...]
        "index": {
          "size": {
            "total_in_bytes": 141264662420,
            "reused_in_bytes": 0,
            "recovered_in_bytes": 71429720776,
            "percent": "50.6%"
          },
          "files": {
            "total": 194,
            "reused": 0,
            "recovered": 181,
            "percent": "93.3%"
          },
        [...]
      },
      {
        "id": 0,
        [...]
        "index": {
          "size": {
            "total_in_bytes": 0,
            "reused_in_bytes": 0,
            "recovered_in_bytes": 0,
            "percent": "0.0%"
          },
          "files": {
            "total": 0,
            "reused": 0,
            "recovered": 0,
            "percent": "0.0%"
          },
        [...]
      },

When seen in _cat/shards, we see a normal size on both servers:

corp_b 0 p STARTED 125838256 132.5gb 162.211.235.11 pcorplog2
corp_b 0 r STARTED 125838280 131.8gb 162.211.235.20 pcorplog3

Is this normal? How should we interpret this information?

Two situations I would expect to see that are a peer recovery to a synced-flushed peer, or a recovery that hasn't really started yet. I don't have an easy way to look at the code for such an old version, sorry, so I can't tell you if maybe there's another path to that.

We did perform a force sync before we rebooted the node, as per the rolling restart / upgrade guide we followed, that may be part of how we ended up here.

Regardless we are back up and operational and indexing data, thank you very much for all your help @DavidTurner, really appreciate it!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.