Unassigned shard with inconsistent primary state and doc count differences

johnfzc · October 16, 2019, 1:15pm

I have a three node ES cluster running 5.6.5. Last week I restarted the servers in the cluster to make changes to their disk configuration, I did not follow any process to do this but after 12 hours the cluster and indexes were back into a green state. Yesterday I restarted one server, following the rolling upgrade process but without upgrading the server. Indexing (logstash) is still stopped and no documents are being added to the system.

I ended up in a state with sync issues, which appears to be identical to this post, I are also running a similarly antiquated version, a situation I are working to address.

I attempted to reroute the bad index, then to remove and re-add replicas, which resulted in a wealth of failures and us being in the following state:

corp_b 0 r STARTED    125766000 131.5gb 162.211.235.20 pcorplog3
corp_b 0 p STARTED    125766000 131.5gb 162.211.235.11 pcorplog2
corp_b 1 p STARTED    125746770 128.6gb 162.211.235.11 pcorplog2
corp_b 1 r UNASSIGNED
corp_b 2 p STARTED    125765953 130.9gb 162.211.235.20 pcorplog3
corp_b 2 r STARTED    125765953 130.9gb 162.211.235.10 pcorplog1
corp_b 3 r STARTED    125765360 132.1gb 162.211.235.11 pcorplog2
corp_b 3 p STARTED    125765360 132.1gb 162.211.235.10 pcorplog1
corp_b 4 r STARTED    125752747 133.2gb 162.211.235.11 pcorplog2
corp_b 4 p STARTED    125752747 133.2gb 162.211.235.10 pcorplog1
corp_b 5 p STARTED    125771560   128gb 162.211.235.20 pcorplog3
corp_b 5 r STARTED    125771560   128gb 162.211.235.11 pcorplog2

This report, from _cat/shards, indicates that the primary shard for this index is located on pcorplog2 and the replica is unassigned. I see the same information when querying _cat/shards on all three cluster members.

However, when I look at _cluster/allocation/explain?pretty I seem to get a different view as to the primary index of this shard; I receive failure reports from the three nodes as follows:

pcorplog1: IllegalStateException[try to recover [corp_b][1] from primary shard with sync id but number of docs differ: 125852748 (pcorplog3, primary) vs 125852744(pcorplog1)
pcorplog2: IllegalStateException[try to recover [corp_b][1] from primary shard with sync id but number of docs differ: 125852748 (pcorplog3, primary) vs 125852744(pcorplog1)
pcorplog3: IllegalStateException[try to recover [corp_b][1] from primary shard with sync id but number of docs differ: 125852748 (pcorplog3, primary) vs 125852744(pcorplog1)

I'm very eager to get this issue solved, thanks in advance for any assistance you can offer, full allocation explain follows in a second post.

johnfzc · October 16, 2019, 1:18pm

{
  "index" : "corp_b",
  "shard" : 1,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2019-10-15T19:57:35.647Z",
    "failed_allocation_attempts" : 5,
    "details" : "failed recovery, failure RecoveryFailedException[[corp_b][1]: Recovery failed from {pcorplog3}{5XIoN5ESQBeUL54_RK5KQQ}{Z865bkClQUu1v0gpEBJsdQ}{162.211.235.20}{162.211.235.20:9300}{ml.max_open_jobs=10, ml.enabled=true} into {pcorplog1}{m7D5fhFhR0W5jzqZP5GExA}{xiwBLQAzT3qiziiTxVgNuA}{162.211.235.10}{162.211.235.10:9300}{ml.max_open_jobs=10, ml.enabled=true}]; nested: RemoteTransportException[[pcorplog3][162.211.235.20:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [0] files with total size of [0b]]; nested: IllegalStateException[try to recover [corp_b][1] from primary shard with sync id but number of docs differ: 125852748 (pcorplog3, primary) vs 125852744(pcorplog1)]; ",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "5XIoN5ESQBeUL54_RK5KQQ",
      "node_name" : "pcorplog3",
      "transport_address" : "162.211.235.20:9300",
      "node_attributes" : {
        "ml.max_open_jobs" : "10",
        "ml.enabled" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2019-10-15T19:57:35.647Z], failed_attempts[5], delayed=false, details[failed recovery, failure RecoveryFailedException[[corp_b][1]: Recovery failed from {pcorplog3}{5XIoN5ESQBeUL54_RK5KQQ}{Z865bkClQUu1v0gpEBJsdQ}{162.211.235.20}{162.211.235.20:9300}{ml.max_open_jobs=10, ml.enabled=true} into {pcorplog1}{m7D5fhFhR0W5jzqZP5GExA}{xiwBLQAzT3qiziiTxVgNuA}{162.211.235.10}{162.211.235.10:9300}{ml.max_open_jobs=10, ml.enabled=true}]; nested: RemoteTransportException[[pcorplog3][162.211.235.20:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [0] files with total size of [0b]]; nested: IllegalStateException[try to recover [corp_b][1] from primary shard with sync id but number of docs differ: 125852748 (pcorplog3, primary) vs 125852744(pcorplog1)]; ], allocation_status[no_attempt]]]"
        }
      ]
    },
    {
      "node_id" : "m7D5fhFhR0W5jzqZP5GExA",
      "node_name" : "pcorplog1",
      "transport_address" : "162.211.235.10:9300",
      "node_attributes" : {
        "ml.max_open_jobs" : "10",
        "ml.enabled" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2019-10-15T19:57:35.647Z], failed_attempts[5], delayed=false, details[failed recovery, failure RecoveryFailedException[[corp_b][1]: Recovery failed from {pcorplog3}{5XIoN5ESQBeUL54_RK5KQQ}{Z865bkClQUu1v0gpEBJsdQ}{162.211.235.20}{162.211.235.20:9300}{ml.max_open_jobs=10, ml.enabled=true} into {pcorplog1}{m7D5fhFhR0W5jzqZP5GExA}{xiwBLQAzT3qiziiTxVgNuA}{162.211.235.10}{162.211.235.10:9300}{ml.max_open_jobs=10, ml.enabled=true}]; nested: RemoteTransportException[[pcorplog3][162.211.235.20:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [0] files with total size of [0b]]; nested: IllegalStateException[try to recover [corp_b][1] from primary shard with sync id but number of docs differ: 125852748 (pcorplog3, primary) vs 125852744(pcorplog1)]; ], allocation_status[no_attempt]]]"
        }
      ]
    },
    {
      "node_id" : "vZxO4e79QjC3TQzOTapg1Q",
      "node_name" : "pcorplog2",
      "transport_address" : "162.211.235.11:9300",
      "node_attributes" : {
        "ml.max_open_jobs" : "10",
        "ml.enabled" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2019-10-15T19:57:35.647Z], failed_attempts[5], delayed=false, details[failed recovery, failure RecoveryFailedException[[corp_b][1]: Recovery failed from {pcorplog3}{5XIoN5ESQBeUL54_RK5KQQ}{Z865bkClQUu1v0gpEBJsdQ}{162.211.235.20}{162.211.235.20:9300}{ml.max_open_jobs=10, ml.enabled=true} into {pcorplog1}{m7D5fhFhR0W5jzqZP5GExA}{xiwBLQAzT3qiziiTxVgNuA}{162.211.235.10}{162.211.235.10:9300}{ml.max_open_jobs=10, ml.enabled=true}]; nested: RemoteTransportException[[pcorplog3][162.211.235.20:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [0] files with total size of [0b]]; nested: IllegalStateException[try to recover [corp_b][1] from primary shard with sync id but number of docs differ: 125852748 (pcorplog3, primary) vs 125852744(pcorplog1)]; ], allocation_status[no_attempt]]]"
        },
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[corp_b][1], node[vZxO4e79QjC3TQzOTapg1Q], [P], s[STARTED], a[id=J7YE4wsKRgK8OQEwzKJDYA]]"
        }
      ]
    }
  ]
}

DavidTurner · October 16, 2019, 1:49pm

How soon after removing the replicas did you add them back again?

johnfzc · October 16, 2019, 2:02pm

Initially the replicas of shards 3 and 4 were moving between nodes 2 and 3, fluctatuting between unassigned and initiating states. We attempted to cancel and move them using the cluster/reroute API. After this didn't have any apparent impact we attempted to set replicas to 0. This resolved the issues for shards 3 and 4, the cluster attmpted to relocate shard 1 to a different node to balance the cluster. We waited for about two hours and the situation wasn't improving so we set the nummber of replicas to 1. This resulted in the following state with many errors in the following hours.

corp_b 0 r INITIALIZING                   162.211.235.20 pcorplog3
corp_b 0 p STARTED      125871592 131.5gb 162.211.235.11 pcorplog2
corp_b 1 p RELOCATING   125852748 128.6gb 162.211.235.20 pcorplog3 -> 162.211.235.11 vZxO4e79QjC3TQzOTapg1Q pcorplog2
corp_b 1 r UNASSIGNED
corp_b 2 p STARTED      125871918 130.9gb 162.211.235.20 pcorplog3
corp_b 2 r INITIALIZING                   162.211.235.10 pcorplog1
corp_b 3 r INITIALIZING                   162.211.235.11 pcorplog2
corp_b 3 p STARTED      125870697 132.1gb 162.211.235.10 pcorplog1
corp_b 4 r INITIALIZING                   162.211.235.11 pcorplog2
corp_b 4 p STARTED      125858400 133.2gb 162.211.235.10 pcorplog1
corp_b 5 p STARTED      125877705   128gb 162.211.235.20 pcorplog3
corp_b 5 r INITIALIZING                   162.211.235.11 pcorplog2

This morning we found ourselves in the state you see described in my post above.

EDIT: We also didn't see any data transfer occurring up until we re-added the replicas and even this transfer didn't occur immediately, there was some significant delay before it started.

We have a few gigs of elasticsearch.log files, if there are specific strings we can search for to clarify the state we'll be happy to provide them.

Thanks very much for your help!

DavidTurner · October 16, 2019, 2:15pm

Ok, interesting. I suspect the ongoing rebalance attempts may have prevented the out-of-sync copy from being cleaned up properly, because the cleanup only happens if the shard is fully assigned and none of the copies are relocating. I would suggest disabling rebalancing and then setting the number of replicas back down to 0 to avoid the relocation.

johnfzc · October 16, 2019, 2:34pm

Ok, great, I'll try that shortly and report back on the resulting state. We'll ensure that we let you know what's going on.

Just to clarify by disabling rebalancing you mean disabling shard allocation using the following setting, or is there another setting we should adjust?

    "cluster.routing.allocation.enable": "none"

Thanks again!

DavidTurner · October 16, 2019, 2:34pm

Yes that's right.

DavidTurner · October 16, 2019, 2:38pm

Agh, sorry, I misread the vital word! The correct setting is "cluster.routing.rebalance.enable": "none"

johnfzc · October 16, 2019, 3:27pm

Ok, it's looking like that worked!!!

corp_b 0 p STARTED 125766000 131.5gb 162.211.235.11 pcorplog2
corp_b 1 p STARTED 125746770 128.6gb 162.211.235.11 pcorplog2
corp_b 2 p STARTED 125765953 130.9gb 162.211.235.20 pcorplog3
corp_b 3 p STARTED 125765360 132.1gb 162.211.235.10 pcorplog1
corp_b 4 p STARTED 125752747 133.2gb 162.211.235.10 pcorplog1
corp_b 5 p STARTED 125771560   128gb 162.211.235.20 pcorplog3

I should be able to bring the replica count back up to 1 now and have ES add the replicas?

DavidTurner · October 16, 2019, 4:09pm

Yes, let's have those replicas back. I expect them to allocate cleanly but it would be good to hear either way.

johnfzc · October 16, 2019, 4:36pm

I've increased the replicas back to 1 and it appears to be initializing, we're following the status closely and will update the thread later today.

corp_b 0 r INITIALIZING                   162.211.235.20 pcorplog3
corp_b 0 p STARTED      125766000 131.5gb 162.211.235.11 pcorplog2
corp_b 1 p STARTED      125746770 128.6gb 162.211.235.11 pcorplog2
corp_b 1 r INITIALIZING                   162.211.235.10 pcorplog1
corp_b 2 p STARTED      125765953 130.9gb 162.211.235.20 pcorplog3
corp_b 2 r INITIALIZING                   162.211.235.11 pcorplog2
corp_b 3 r INITIALIZING                   162.211.235.20 pcorplog3
corp_b 3 p STARTED      125765360 132.1gb 162.211.235.10 pcorplog1
corp_b 4 r INITIALIZING                   162.211.235.11 pcorplog2
corp_b 4 p STARTED      125752747 133.2gb 162.211.235.10 pcorplog1
corp_b 5 p STARTED      125771560   128gb 162.211.235.20 pcorplog3
corp_b 5 r INITIALIZING                   162.211.235.10 pcorplog1

johnfzc · October 16, 2019, 6:02pm

Everything is looking pretty good and I can see data being transferred between the nodes on the network but I don't really know how much time is left.

Is it possible to get a status on the shard replication status and percentage complete? I can see some details in /_cluster/health/corp_b?level=shards but it doesn't give me any indication of the transfer progress.

Thanks again!

DavidTurner · October 16, 2019, 6:27pm

Yes, the index recovery API is the place to look for more details.

johnfzc · October 16, 2019, 10:37pm

The recovery is proceeding well for some shards, but I have some concerns about shard 0.

Here's the current _cat/shards status:

corp_b 2 p STARTED      125765953 130.9gb 162.211.235.20 pcorplog3
corp_b 2 r STARTED      125765953 130.9gb 162.211.235.11 pcorplog2
corp_b 4 r INITIALIZING                   162.211.235.11 pcorplog2
corp_b 4 p STARTED      125752747 133.2gb 162.211.235.10 pcorplog1
corp_b 5 p STARTED      125771560   128gb 162.211.235.20 pcorplog3
corp_b 5 r STARTED      125771560   128gb 162.211.235.10 pcorplog1
corp_b 3 r INITIALIZING                   162.211.235.20 pcorplog3
corp_b 3 p STARTED      125765360 132.1gb 162.211.235.10 pcorplog1
corp_b 1 p STARTED      125746770 128.6gb 162.211.235.11 pcorplog2
corp_b 1 r STARTED      125746770 128.6gb 162.211.235.10 pcorplog1
corp_b 0 r INITIALIZING                   162.211.235.20 pcorplog3
corp_b 0 p STARTED      125766000 131.5gb 162.211.235.11 pcorplog2

When we break out the index size and files information from the corp_b/_recovery status we see the following:

johnf@DESKTOP-KH3IF90:~/Documents/src/corp-central-monitoring/script/es$ ./get-index-recovery-status.sh corp_b | jq '.corp_b.shards[] | select(.id==0) | [.id, .index.size, .index.files]'
[
  0,
  {
    "total_in_bytes": 141264662420,
    "reused_in_bytes": 0,
    "recovered_in_bytes": 134678320808,
    "percent": "95.3%"
  },
  {
    "total": 194,
    "reused": 0,
    "recovered": 192,
    "percent": "99.0%"
  }
]
[
  0,
  {
    "total_in_bytes": 0,
    "reused_in_bytes": 0,
    "recovered_in_bytes": 0,
    "percent": "0.0%"
  },
  {
    "total": 0,
    "reused": 0,
    "recovered": 0,
    "percent": "0.0%"
  }
]

The recovery percentage is increasing but I'm still concerned by the state seen here.

I'll update again tomorrow morning unless there are further questions, if you want more information from the recovery status please let me know. Really appreciate your assistance!

johnfzc · October 17, 2019, 4:20pm

I've got the cluster back online and everything is operating correctly, however I do have some questions about the recovery information shown in the index section of the recovery document. Some of the shards show a state with files and byte values that are close to each other on the two replicas of the shard, however others have a state like this where we see a size of 0 and 0 files on an index, here's the output from /_recovery:

      {
        "id": 0,
        [...]
        "index": {
          "size": {
            "total_in_bytes": 141264662420,
            "reused_in_bytes": 0,
            "recovered_in_bytes": 71429720776,
            "percent": "50.6%"
          },
          "files": {
            "total": 194,
            "reused": 0,
            "recovered": 181,
            "percent": "93.3%"
          },
        [...]
      },
      {
        "id": 0,
        [...]
        "index": {
          "size": {
            "total_in_bytes": 0,
            "reused_in_bytes": 0,
            "recovered_in_bytes": 0,
            "percent": "0.0%"
          },
          "files": {
            "total": 0,
            "reused": 0,
            "recovered": 0,
            "percent": "0.0%"
          },
        [...]
      },

When seen in _cat/shards, we see a normal size on both servers:

corp_b 0 p STARTED 125838256 132.5gb 162.211.235.11 pcorplog2
corp_b 0 r STARTED 125838280 131.8gb 162.211.235.20 pcorplog3

Is this normal? How should we interpret this information?

DavidTurner · October 17, 2019, 5:18pm

Two situations I would expect to see that are a peer recovery to a synced-flushed peer, or a recovery that hasn't really started yet. I don't have an easy way to look at the code for such an old version, sorry, so I can't tell you if maybe there's another path to that.

johnfzc · October 17, 2019, 7:49pm

We did perform a force sync before we rebooted the node, as per the rolling restart / upgrade guide we followed, that may be part of how we ended up here.

Regardless we are back up and operational and indexing data, thank you very much for all your help @DavidTurner, really appreciate it!

system · November 14, 2019, 7:49pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ES 5.X - Primary and replica shards not in sync Elasticsearch	8	5406	November 1, 2017
[1.7.2] unassigned shards after restart for indexes with no replication Elasticsearch	7	1877	July 5, 2017
Shards UNASSIGNED even tho they exist on disk Elasticsearch	3	530	July 6, 2017
Unassigned primary shards. reason:NODE LEFT Elasticsearch	4	2099	January 10, 2017
Unassigned shards stays on INITIALIZING Elasticsearch	1	369	July 6, 2017

Unassigned shard with inconsistent primary state and doc count differences

Related topics