Replica Shard is in unallocated state after upgrade to 6.0 from 5.6.0

I have performed a rolling upgrade on my ES cluster to upgrade it from 5.6.0 to 6.0.0. The upgrade went smoothly and all of my nodes were upgraded.

Post upgrade, I noticed that a single replica shard of an index was unallocated with an ERROR, so I used the reroute API with retry_failed to try and allocate this shard. The allocation failed again and the node logged the following error:

[2017-11-20T16:42:31,679][WARN ][o.e.i.c.IndicesClusterStateService] [esnode2] [[http-2017.11.20][1]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [http-2017.11.20][1]: Recovery failed from {esnode1}{sLciS6igSYCduQZ65pa8YQ}{mecNh3TBToyO7VDKgNEetQ}{10.44.0.46}{10.44.0.46:9201} into {esnode2}{N55B5iQBQGy6B7YqpudtRw}{uv8NvPcOT0WVPwnsdzwe5w}{10.44.0.47}{10.44.0.47:9201}
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.doRecovery(PeerRecoveryTargetService.java:282) [elasticsearch-6.0.0.jar:6.0.0]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.access$900(PeerRecoveryTargetService.java:75) [elasticsearch-6.0.0.jar:6.0.0]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryRunner.doRun(PeerRecoveryTargetService.java:617) [elasticsearch-6.0.0.jar:6.0.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-6.0.0.jar:6.0.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.0.0.jar:6.0.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
Caused by: org.elasticsearch.transport.RemoteTransportException: [esnode1][10.44.0.46:9201][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: Phase[2] phase2 failed
        at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:194) ~[elasticsearch-6.0.0.jar:6.0.0]
        at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:98) ~[elasticsearch-6.0.0.jar:6.0.0]
        at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$000(PeerRecoverySourceService.java:50) ~[elasticsearch-6.0.0.jar:6.0.0]
        at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:107) ~[elasticsearch-6.0.0.jar:6.0.0]
        at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:104) ~[elasticsearch-6.0.0.jar:6.0.0]
        at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[elasticsearch-6.0.0.jar:6.0.0]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.0.0.jar:6.0.0]
        at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1540) ~[elasticsearch-6.0.0.jar:6.0.0]
        ... 5 more
Caused by: org.elasticsearch.transport.RemoteTransportException: [esnode2][10.44.0.47:9201][internal:index/shard/recovery/translog_ops]
Caused by: org.elasticsearch.index.translog.TranslogException: Failed to write operation [Index{id='AV_Y6q4h5gnndJiKtcQn', type='bro', seqNo=-2, primaryTerm=0}]
        at org.elasticsearch.index.translog.Translog.add(Translog.java:520) ~[elasticsearch-6.0.0.jar:6.0.0]
at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:708) ~[elasticsearch-6.0.0.jar:6.0.0]
        at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:727) ~[elasticsearch-6.0.0.jar:6.0.0]
        at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:696) ~[elasticsearch-6.0.0.jar:6.0.0]
        at org.elasticsearch.index.shard.IndexShard.applyTranslogOperation(IndexShard.java:1214) ~[elasticsearch-6.0.0.jar:6.0.0]
        at org.elasticsearch.indices.recovery.RecoveryTarget.indexTranslogOperations(RecoveryTarget.java:395) ~[elasticsearch-6.0.0.jar:6.0.0]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$TranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:442) ~[elasticsearch-6.0.0.jar:6.0.0]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$TranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:433) ~[elasticsearch-6.0.0.jar:6.0.0]
        at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[elasticsearch-6.0.0.jar:6.0.0]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.0.0.jar:6.0.0]
        at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1540) ~[elasticsearch-6.0.0.jar:6.0.0]
        ... 5 more
Caused by: java.lang.IllegalArgumentException: sequence number must be assigned
        at org.elasticsearch.index.seqno.SequenceNumbers.min(SequenceNumbers.java:90) ~[elasticsearch-6.0.0.jar:6.0.0]
        at org.elasticsearch.index.translog.TranslogWriter.add(TranslogWriter.java:202) ~[elasticsearch-6.0.0.jar:6.0.0]
        ... some more traceback

Apparently there is some problem with the sequence number. As this was only a replica shard, I removed and re-added replicas to work around the problem. However, does anyone have any idea why this error came up?

1 Like

We are facing the same problem, but removing and re-adding replicas did not solve the problem. Reducing the number of replicas to 1 or 0 gets the cluster back to green. Setting the number of replicas back to 2 leads to the same error again.

UPDATE:
This is how we recovered the index with the unassigned replica shards:

  1. Set num. replicas to 0
  2. Set index read-only
  3. snyced flush on index
  4. Set index back to writable
  5. Set num. replicas back to 2

This looks to be a bug. What fixed it in your case was the flush. I'll look more into this.

I'm seeing the same on my newly upgraded ES6 cluster. At first I thought this was related to a few "custom" indexes, but they seem to be a bit all over the place. I've tried performing the following:

  1. Set replicas to zero
  2. Set index to read-only
  3. Perform sync flush against the index
  4. Set index to read/write
  5. Set replicas to 1.

Weird thing is that as I was doing this and waiting for recovery on one index, another index started showing the same error in the "_cluster/allocation" api. Feels like this is a "global" issue to me.

Just a follow-up on this: 12 hours after the upgrade, everything seems fine. I did the above on all indexes with unallocated shards (we had 4 with this symptom) and everything worked out. I used the following (powershell) script to fix (note that the script will change your replica settings, so run with caution):

$EnvObj = Get-content ./env.json | convertfrom-json
$ClusterUrl = $EnvObj.url
$UserName = $EnvObj.user
$Password = $EnvObj.password

$Cred = [pscredential]::new($Username, ($Password | ConvertTo-SecureString -AsPlainText -Force))

$IndexName = ".monitoring-es-6-2017.11.25" #the index you want to fix

$ReduceNumberOfReplicas = @"
{
    "index" : {
        "number_of_replicas" : 0
    }
}
"@
$RestoreNumberOfReplicas = @"
{
    "index" : {
        "number_of_replicas" : 1
    }
}
"@
$ReadOnlyOn = @"
{
    "index" : {
        "blocks" : {
            "read_only": true
        }
    }
}
"@
$ReadOnlyOff = @"
{
    "index" : {
        "blocks" : {
            "read_only": false
        }
    }
}
"@


Invoke-RestMethod -Method Put -Body $ReduceNumberOfReplicas -Uri "$ClusterUrl/$IndexName/_settings" -Credential $Cred -ContentType "application/json"
Invoke-RestMethod -Method Put -Body $ReadOnlyOn -Uri "$ClusterUrl/$IndexName/_settings" -Credential $Cred -ContentType "application/json"
Invoke-RestMethod -Method Post -Uri "$ClusterUrl/$IndexName/_flush/synced" -Credential $Cred
Invoke-RestMethod -Method Put -Body $ReadOnlyOff -Uri "$ClusterUrl/$IndexName/_settings" -Credential $Cred -ContentType "application/json"
Invoke-RestMethod -Method Put -Body $RestoreNumberOfReplicas -Uri "$ClusterUrl/$IndexName/_settings" -Credential $Cred -ContentType "application/json"


For anyone else landing here in search of help: I tried the solutions above but they didn't work for me. However, after finding https://github.com/elastic/elasticsearch/issues/27536, I added in steps to set the translog retention to 0 (and later back to the default), and that did work. So for me, the sequence was:

curl -X PUT localhost:9200/<index_name>/_settings -H 'Content-Type: application/json' -d '
{
    "index" : {
        "number_of_replicas" : 0
    }
}'

curl -X PUT localhost:9200/<index_name>/_settings -H 'Content-Type: application/json' -d '
{
    "index" : {
        "translog.retention.size" : 0
    }
}'

curl -X PUT localhost:9200/<index_name>/_settings -H 'Content-Type: application/json' -d '
{
    "index" : {
        "blocks" : {"read_only": true}
    }
}'

curl -X POST localhost:9200/<index_name>/_flush/synced

curl -X PUT localhost:9200/<index_name>/_settings -H 'Content-Type: application/json' -d '
{
    "index" : {
        "blocks" : {"read_only": false}
    }
}'

curl -X PUT localhost:9200/<index_name>/_settings -H 'Content-Type: application/json' -d '
{
    "index" : {
        "translog.retention.size" : null
    }
}'

curl -X PUT localhost:9200/<index_name>/_settings -H 'Content-Type: application/json' -d '
{
    "index" : {
        "number_of_replicas" : 1
    }
}'
2 Likes

try close the index and open again..

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.