Hi Aaron, late to this party for sure, sorry. I feel your pain, this is
happening for us, and I've seen reports of it occurring across versions,
but with very little information to go on I don't think progress has been
made. I actually don't think there's an issue raised for it. Perhaps that
should be a first step.
We call this problem a "Flappy Item" because the item appears, disappears
in search results depending on whether the search hits the primary or
replica shard. Flaps back and forth.
The only way to repair the problem is to rebuild the replica shard. You
can disable all replicas and then re-enable them, and the primary shard
will be used as the source and it will work. That's if you can live with
the lack of redundancy for that length of time....
Alternatively we have found that issuing a Move command to relocate the
replica shard off the current host and on to another, also causes ES to
generate a new replica shard using the primary as the source, and that
corrects the problem.
A caveat we've found with this approach at least with the old version of ES
we're sadly still using (0.19... hmm) that after the move, the cluster will
likely want to rebalance, and the shard allocation after rebalance can from
time to time put the replica back where it was. ES on that original node
then goes "Oh look, here's the same shard I had earlier, lets use that"..
Which means you're back to square one.. You can force all replica
shards to move by coming up with a move command that shuffles them around,
and that definitely does work, but obviously takes longer for large
clusters.
In terms of tooling around this, I offer you these:
Scrutineer - https://github.com/Aconex/scrutineer- Can detect differences
between your source of truth (db?) and your index (ES). This does pickup
the case where the replica is reporting an item that should have been
deleted.
Flappy Item Detector - https://github.com/Aconex/es-flappyitem-detector -
given a set of suspect IDs can check the primary vs replica to confirm/deny
it being one of these cases. There is also support to issue basic move
commands with some simple logic to attempt to rebuild that replica.
Hope that helps.
cheers,
Paul Smith
On 8 August 2014 01:14, aaron atdixon@gmail.com wrote:
I've noticed on a few of my clusters that some shard replicas will be
perpetually inconsistent w/ other shards. Even when all of my writes are
successful and use write_consistency = ALL and replication = SYNC.
A GET by id will return 404/missing for one replica but return the
document for the other two replicas. Even after refresh, the shard is never
"repaired".
Using ES 0.90.7.
Is this a known defect? Is there a means to detect, prevent, or at least
detect & repair when this occurs?
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/164c3362-1ed4-4e90-8bb6-283543a20cf9%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/164c3362-1ed4-4e90-8bb6-283543a20cf9%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHfYWB7XhaB-ZkJqE8%3DfNu%2BZdNzGC%2BPx%3Dv61OGTzQABbHNZfSg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.