Failing shard Recovery

dawiro · February 19, 2019, 8:16am

Hi,
We're running elasticsearch 5.6.3 and have a shard that is failing recovery or rerouted assignment. The error I'm seeing is this:

failed recovery, failure RecoveryFailedException[[index-2019.02.12][3]: Recovery failed from {node-data-004}{dtLXUtdoSXCDhDjQouupWg}{GukEZHvRQ_WOazMlJSdLuA}{192.168.1.71}{192.168.1.71:9300}{zone=az2} into {node-data-006}{KfzBT5rkST-AhbjKWQKYAg}{mMpZUUGrQ5Os9HXrBPLsNw}{192.168.1.79}{192.168.1.79:9300}{zone=az1}]; nested: RemoteTransportException[[node-data-004][192.168.1.71:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [0] files with total size of [0b]]; nested: IllegalStateException[try to recover [index-2019.02.12][3] from primary shard with sync id but number of docs differ: 52986534 (node-data-004, primary) vs 52986461(node-data-006)];

How do I recover from this type of failure?

Regards,
David

harshbajaj16 · February 19, 2019, 8:56am

Hi @dawiro,

This error shows that primary and replica shards doesn't have the same number of documents. You need to reindex your index in this case.
OR
You can try this it may be helpful for you.
If you are certain that the primary shards contain all the updates, a quick fix would be to recreate the replica shards for that index. First drop them by setting number_of_replicas to 0:

curl -XPUT "<es-prod-url>:9200/<my-index>/_settings?pretty=1" -H 'Content-Type: application/json' -d'
{
    "index" : {
        "number_of_replicas": 0
    }
}'

And then create them again so:

curl -XPUT "<es-prod-url>:9200/<my-index>/_settings?pretty=1" -H 'Content-Type: application/json' -d'
{
    "index" : {
        "number_of_replicas": 1
    }
}'

Regards,
Harsh Bajaj

DavidTurner · February 19, 2019, 8:59am

This is not true. There is a known issue in versions prior to 6.3.0 that can lead to this, but there is no need to reindex the data. The quick fix is to remove all replicas and then rebuild them as shown. The only way to prevent it happening again is to upgrade to a version ≥ 6.3.0.

harshbajaj16 · February 19, 2019, 9:05am

Hi @DavidTurner,

Could you please help me to understand the below message in which mentioned docs differ with count ???

DavidTurner · February 19, 2019, 9:09am

Sure. There were actually three related issues and they are described on the resiliency status page. All were to do with unexpected interactions between document deletions and out-of-order message delivery. The description links you to the PRs that fix them in 6.3.0 if you want to drill into the full detail.

system · March 19, 2019, 9:09am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Try to recover [test-20181128][2] from primary shard with sync id but number of docs differ: 59432 (10.1.1.189, primary) vs 60034(10.1.1.190) Elasticsearch	2	2274	December 29, 2018
Problems upgrading to 1.5.0 Elasticsearch	1	419	July 6, 2017
Constant Recovering and Unassigned shards for an index Elasticsearch	12	1022	July 6, 2017
Unassigned shard with inconsistent primary state and doc count differences Elasticsearch	17	1183	November 14, 2019
Unassigned replica shards after cluster recovery Elasticsearch	2	1254	July 5, 2017

Failing shard Recovery

Related topics