Hi,
We're running elasticsearch 5.6.3 and have a shard that is failing recovery or rerouted assignment. The error I'm seeing is this:
failed recovery, failure RecoveryFailedException[[index-2019.02.12][3]: Recovery failed from {node-data-004}{dtLXUtdoSXCDhDjQouupWg}{GukEZHvRQ_WOazMlJSdLuA}{192.168.1.71}{192.168.1.71:9300}{zone=az2} into {node-data-006}{KfzBT5rkST-AhbjKWQKYAg}{mMpZUUGrQ5Os9HXrBPLsNw}{192.168.1.79}{192.168.1.79:9300}{zone=az1}]; nested: RemoteTransportException[[node-data-004][192.168.1.71:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [0] files with total size of [0b]]; nested: IllegalStateException[try to recover [index-2019.02.12][3] from primary shard with sync id but number of docs differ: 52986534 (node-data-004, primary) vs 52986461(node-data-006)];
This error shows that primary and replica shards doesn't have the same number of documents. You need to reindex your index in this case. OR
You can try this it may be helpful for you.
If you are certain that the primary shards contain all the updates, a quick fix would be to recreate the replica shards for that index. First drop them by setting number_of_replicas to 0:
This is not true. There is a known issue in versions prior to 6.3.0 that can lead to this, but there is no need to reindex the data. The quick fix is to remove all replicas and then rebuild them as shown. The only way to prevent it happening again is to upgrade to a version ≥ 6.3.0.
Sure. There were actually three related issues and they are described on the resiliency status page. All were to do with unexpected interactions between document deletions and out-of-order message delivery. The description links you to the PRs that fix them in 6.3.0 if you want to drill into the full detail.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.