Failing shard Recovery

Hi,
We're running elasticsearch 5.6.3 and have a shard that is failing recovery or rerouted assignment. The error I'm seeing is this:

failed recovery, failure RecoveryFailedException[[index-2019.02.12][3]: Recovery failed from {node-data-004}{dtLXUtdoSXCDhDjQouupWg}{GukEZHvRQ_WOazMlJSdLuA}{192.168.1.71}{192.168.1.71:9300}{zone=az2} into {node-data-006}{KfzBT5rkST-AhbjKWQKYAg}{mMpZUUGrQ5Os9HXrBPLsNw}{192.168.1.79}{192.168.1.79:9300}{zone=az1}]; nested: RemoteTransportException[[node-data-004][192.168.1.71:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [0] files with total size of [0b]]; nested: IllegalStateException[try to recover [index-2019.02.12][3] from primary shard with sync id but number of docs differ: 52986534 (node-data-004, primary) vs 52986461(node-data-006)];

How do I recover from this type of failure?

Regards,
David

Hi @dawiro,

This error shows that primary and replica shards doesn't have the same number of documents. You need to reindex your index in this case.
OR
You can try this it may be helpful for you.
If you are certain that the primary shards contain all the updates, a quick fix would be to recreate the replica shards for that index. First drop them by setting number_of_replicas to 0:

curl -XPUT "<es-prod-url>:9200/<my-index>/_settings?pretty=1" -H 'Content-Type: application/json' -d'
{
    "index" : {
        "number_of_replicas": 0
    }
}'

And then create them again so:

curl -XPUT "<es-prod-url>:9200/<my-index>/_settings?pretty=1" -H 'Content-Type: application/json' -d'
{
    "index" : {
        "number_of_replicas": 1
    }
}'

Regards,
Harsh Bajaj

1 Like

This is not true. There is a known issue in versions prior to 6.3.0 that can lead to this, but there is no need to reindex the data. The quick fix is to remove all replicas and then rebuild them as shown. The only way to prevent it happening again is to upgrade to a version ≥ 6.3.0.

2 Likes

Hi @DavidTurner,

Could you please help me to understand the below message in which mentioned docs differ with count ???

Sure. There were actually three related issues and they are described on the resiliency status page. All were to do with unexpected interactions between document deletions and out-of-order message delivery. The description links you to the PRs that fix them in 6.3.0 if you want to drill into the full detail.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.