Hi,
Recently we upgraded our Production clusters to ES 5.x We are observing in several cases where the number of documents in primary and secondary are out of sync.
Especially, primary shards in this case had lesser documents than the secondary and when we had to restart the Elasticsearch service, shards were unable to get initialized due to this error.
Q1) Why is a shard with less number of documents gets chosen as primary?
Q2) If the syncing of documents lets say is in-progress at the background, why is it stopping a replica shard from getting assigned/started? Is there a way to force sync primary and replica?
Q3) Any recommendation on how to handle this in the best possible way ?
->The logs indicated they have different counts, pasting the logs below.
->Our indexing cases are such that mostly >95% is mere indexing and less than 5% accounts for deletion. So in such a scale we can't have ~500K docs difference just due to deletion.
->The error in restart i meant , when restarting the ES service the shards fails to get assigned and puts the cluster in yellow and even red state(if it was a primary shard that failed to get assigned).
2017-09-19T10:21:09.8317348Z searchindex1 0 r UNASSIGNED ALLOCATION_FAILED failed recovery, failure RecoveryFailedException[[searchindex1][0]: Recovery failed from {es-d06-rm}{6xSE_CzOSXiig0Ehr-3r8A}{RM8jwlD0TG2yPmA_xb2WpQ}{20.0.0.156}{20.0.0.156:9300}{faultDomain=1, updateDomain=0, ml.enabled=true} into {es-d02-rm}{ix5uwlQDTm6pDdawWIqg_A}{2tsQzSO7T12YsqH2VRkhog}{20.0.0.152}{20.0.0.152:9300}{faultDomain=1, ml.enabled=true, updateDomain=1}]; nested: RemoteTransportException[[es-d06-rm][20.0.0.156:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [0] files with total size of [0b]]; nested: IllegalStateException[try to recover [searchindex1][0] from primary shard with sync id but number of docs differ: 3174006 (es-d06-rm, primary) vs 3474424(es-d02-rm)];
I saw the same error message a month ago on a 5.4.1 cluster when we were upgrading the OS on all our servers. By accident I took down one server before the previously restarted server had had time to update all of its shards.
When that second server came back up, the old primary remained unassigned as a stale primary. The error message I got said the same thing you mention and simply meant that the Elastic cluster could not make a choice between the former primary and the not-up-to-date replica as to which to promote to primary replica.
The best solution if you still have the original data would be to drop the index containing the stale primary and then create a new index that you populate with the original data. If that is not an option, it certainly wasn't for me, you may try to force Elastic to pick one as the primary by using the Cluster Reroute API.
I managed to save the original primary (but possibly lost some data) by forcing the stale primary (which was shard #3 in the index) to be activated, something like this:
Thanks Bernt!
I exactly did the same what you have suggested for the mitigation.
Mitigating PROD issues every time with loss of data is something i would like avoid. Isn't it correct to expect ES to take care of the syncing/promoting between primary and replica shards ?
I would like to know if this some known scenario where automatic mitigation isn't possible at all or there could be a potential fix.
Yes, Elastic will normally handle the syncing of shards and promoting of new primaries really well but only as long as you keep the cluster in a healthy green state before taking down (by accident or on purpose) a node in the cluster.
The problem (in my case) was that the primary resided on a node in data room 2 and the replica on a node in data room 1 (as by cluster design) and that I took down the node in data room 1 where the primary resided before it had time to update the replica completely. So in this case it really was my own fault for not waiting for the cluster to become green before taking down that node.
Naturally, if your cluster loses many nodes by accident the same situation may occur. But if the cluster goes red you won't be able to index into it, in that case the shards will probably be fine when the nodes come back up. So I have only experienced this once, while having experienced red cluster health multiple times.
Thanks Bernt!
We have encountered something similar to what you explained, primary going down when replicas were initializing.
Do you have any explanation around why there was lesser no of docs in primary compared to replica, as seen in the exception?
I am asking this because as primary went down cluster became RED and there is no way any indexing request[addition/deletion of docs] to be accepted. In that case where could replica have got the extra docs from. The only possibility i could think of is when primary went down there could have been loss of data that it was trying to recover from, possible?
Also would be good to know if there is any force sync API available to sync both primary and replica.
No, a primary shard won't lose data if it goes down, the extra docs in the replica must have been added after the primary went down when that replica was promoted new primary. A shard is a Lucene index, with immutable segments, and I doubt very much it can lose one or more segments, if so I would think the entire shard would be marked as bust and not come back up.
One theory is that when the original primary went down it still hadn't had time to update the replica with all its documents so when the replica became the new primary it had less documents than the old primary (not sure if that can happen, but it's a theory). So when the old primary came back online, now as a replica, it still had more docs because the new primary hadn't been indexed to yet (the index being RED, remember).
If that is what happened you would expect the old primary to have the best data so that's the shard you should keep in the index not the new primary.
As far as I know there is no force sync API, your best bet would be to use the Cluster Reroute API that I mentioned earlier, it allows you to select one of the shards and promote it to primary status.
By the way, when I first started looking at Elasticsearch I heard repeatedly that it should not be used for primary data storage, that you should always keep the original data in a safe backup or repository somewhere. Even five years later that seems like a good advice, things can still go wrong like in the example with the stale primary shard and then data may get lost.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.