Deleted items appears when searching a replica shard

We're using ES 0.19.2 at the moment, in a 3-node configuration.

I have an odd case I can't explain. We have an item in an ES index that
comes back, depending on what host/shard/replica receives the query. We
tracked down this issue to a particular ID for a given type, and when one
does a simple GET request against this ID, ala :

curl -XGET http://localhost:9200/au1/task/168802590

we get alternating-ish views, some showing it doesn't exist, and some
showing it does:

{
_index: au1
_type: task
_id: 168802590
exists: false
}

then do it again, and we get something like this (redacted):

{
_index: au1
_type: task
_id: 168802590
_version: 1359072320783
exists: true
_source: {
id: 168802590
type: awaitingreview
description: BLAH BLAH
creationDate: 1359068463727
projectId: 30979
projectName: BLAH BLAH
stepName: BLAH BLAH
assignedBy: BLAH BLAH
mailId: null
dueDate: 1359154862937
referenceLabel: BLAH BLAH
aggregateCount: 1
featureId: 2
startDate: null
aggregateId: 134736434
assignedToUser: BLAH BLAH
aggregated: true
pertinent: true
assignedByUser: BLAH BLAH
assignedByUserId: 415818
assignedToUserId: 415818
dueDateIso: 20130126
}
}

If I then modify this get to use the ...?preference=_primary to hit only
the primary shard, I get 100% consistent results that exists=false (as in,
the item is deleted on the primary shard).

If I then change preference=_local, and walk over the each node guessing
which node might hold the replica for this item (I don't know what the
routing value would be for this, so don't know which shard it would go to?)
I get consistent results with the full results above. I can't seem to
prove which shard replica that is affected here....?

This leads me to think that the replica for a given shard does not have a
correct view of the deleted items for each segment.

I think I have 2 options:

  • use _optimize to properly expunge the deletes (maybe with just the
    expunge deletes only option, though I still have to use max_num_segments=1
    to force the optimize to happen.

  • Set Replicas to 0, and then rebuild the replica set again by setting it
    back to 1. This is not my preferred, given the small-ish window of
    vulnerability of having no redundancy.

Before I do any of the above, is there any other ideas out there on what I
can, or things I can do to collect more info to help work out why? Is
there other ways to force removal of deleted items?

cheers,

Paul Smith

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Paul

This leads me to think that the replica for a given shard does not
have a correct view of the deleted items for each segment.

Try using the index-status API to compare doc counts between shards:
curl -XGET 'http://127.0.0.1:9200/_status?pretty=1'

Also, worth trying:

  1. Flush the indices, wait for a bit and check again:

    curl -XPOST 'http://localhost:9200/_flush?refresh=true&full=true&force=true'

  2. Clear the caches, wait for 60 seconds and check again:

    curl -XPOST 'http://localhost:9200/_cache/clear'

  3. Try redeleting the doc, or possibly recreating then deleting

If none of these options work, then the below

I think I have 2 options:

  • use _optimize to properly expunge the deletes (maybe with just the
    expunge deletes only option, though I still have to use
    max_num_segments=1 to force the optimize to happen.

and optimize can be IO heavy

  • Set Replicas to 0, and then rebuild the replica set again by setting
    it back to 1. This is not my preferred, given the small-ish window of
    vulnerability of having no redundancy.

This would be my preferred choice. Disable allocation before and
reenable after. Recovery should be quick for all shards, but possibly
slightly slower for the incorrect shard.

As to why it has happened? I don't know. Have you had OOMs or some
other similar disruption on the network?

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I tried the flush, cache clear and re-deleting and re-putting back, no
avail.

Rather than go down the replica=0 approach first, I tried optimizing down
to max_num_segments=1 during idle time, but the result still comes up if it
goes through the replica shard. using preference=_primary always comes out
as deleted ok, so I stilly suspect the shard replica, though by optimizing
I'm not sure why the replica never received a 'cleansed' copy of the bad
segment.

I'm down to setting replicas to 0 now, which is a bit nail biting, Murphy
being the sod that he is, this is when stuff happens. Alternatively I
could do a rolling restart of each node, slowly, to all the shards to
rebalance, but I since I don't know which shard contains this record, I may
end up getting the replica copy instead of the primary, but it's possible
that then it can be properly deleted after that. Is there a way to work
out by hand which shard an item will go to by it's ID? What's the default
Hashing algorithm?

Dunno if Shay or anyone else has either ideas? (reminder we're on 0.19.2
at the moment, and yes, don't I wish I could upgrade).

regards,

Paul

On 31 January 2013 20:13, Clinton Gormley clint@traveljury.com wrote:

Hi Paul

This leads me to think that the replica for a given shard does not
have a correct view of the deleted items for each segment.

Try using the index-status API to compare doc counts between shards:
curl -XGET 'http://127.0.0.1:9200/_status?pretty=1'

Also, worth trying:

  1. Flush the indices, wait for a bit and check again:

    curl -XPOST '
    http://localhost:9200/_flush?refresh=true&full=true&force=true'

  2. Clear the caches, wait for 60 seconds and check again:

    curl -XPOST 'http://localhost:9200/_cache/clear'

  3. Try redeleting the doc, or possibly recreating then deleting

If none of these options work, then the below

I think I have 2 options:

  • use _optimize to properly expunge the deletes (maybe with just the
    expunge deletes only option, though I still have to use
    max_num_segments=1 to force the optimize to happen.

and optimize can be IO heavy

  • Set Replicas to 0, and then rebuild the replica set again by setting
    it back to 1. This is not my preferred, given the small-ish window of
    vulnerability of having no redundancy.

This would be my preferred choice. Disable allocation before and
reenable after. Recovery should be quick for all shards, but possibly
slightly slower for the incorrect shard.

As to why it has happened? I don't know. Have you had OOMs or some
other similar disruption on the network?

clint

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Paul

Rather than go down the replica=0 approach first, I tried optimizing
down to max_num_segments=1 during idle time, but the result still
comes up if it goes through the replica shard. using
preference=_primary always comes out as deleted ok, so I stilly
suspect the shard replica, though by optimizing I'm not sure why the
replica never received a 'cleansed' copy of the bad segment.

optimize happens local to each shard, so it wouldn't trigger a
"get-latest-from-primary" action.

I'm down to setting replicas to 0 now, which is a bit nail biting,
Murphy being the sod that he is, this is when stuff happens.
Alternatively I could do a rolling restart of each node, slowly, to
all the shards to rebalance, but I since I don't know which shard
contains this record, I may end up getting the replica copy instead of
the primary, but it's possible that then it can be properly deleted
after that. Is there a way to work out by hand which shard an item
will go to by it's ID? What's the default Hashing algorithm?

Do an 'ids' search with explain turned on. That'll return the node and
shard:

curl -XGET 'http://127.0.0.1:9200/_all/_search?pretty=1' -d '
{
"fields" : ,
"query" : {
"ids" : {
"values" : [
"wVaNiUKTRE-Ax5LLR61OkA"
]
}
},
"explain" : 1
}
'

{

"hits" : {

"hits" : [

{

"_score" : 1,

"_index" : "test",

"_shard" : 2,

"_id" : "wVaNiUKTRE-Ax5LLR61OkA",

"_node" : "x8E-zFIsTGSXFo3xq4hb4A",

"_type" : "test",

"_explanation" : {

"value" : 1,

"details" : [

{

"value" : 1,

"description" : "boost"

},

{

"value" : 1,

"description" : "queryNorm"

}

],

"description" : "ConstantScore(_uid:test#wVaNiUKTRE-Ax5LLR61OkA), product of:"

}

}

],

"max_score" : 1,

"total" : 1

},

"timed_out" : false,

"_shards" : {

"failed" : 0,

"successful" : 5,

"total" : 5

},

"took" : 2

}

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks for that tip Clinton! That's identified the host with the shard
replica. If I run the above type of query I get flapping results (no
results, then a result from the host containing the replica shard), and I
can see it's the host with the replica of that shard that is always
returning that result.

My next step will be to shutdown ES on that node, let the cluster
re-replicate the shard to one of the other nodes and see if the problem
goes away, before bringing that node backup again.

I'll report back on what I find.

thanks again,

Paul

On 6 February 2013 23:10, Clinton Gormley clint@traveljury.com wrote:

Hi Paul

Rather than go down the replica=0 approach first, I tried optimizing
down to max_num_segments=1 during idle time, but the result still
comes up if it goes through the replica shard. using
preference=_primary always comes out as deleted ok, so I stilly
suspect the shard replica, though by optimizing I'm not sure why the
replica never received a 'cleansed' copy of the bad segment.

optimize happens local to each shard, so it wouldn't trigger a
"get-latest-from-primary" action.

I'm down to setting replicas to 0 now, which is a bit nail biting,
Murphy being the sod that he is, this is when stuff happens.
Alternatively I could do a rolling restart of each node, slowly, to
all the shards to rebalance, but I since I don't know which shard
contains this record, I may end up getting the replica copy instead of
the primary, but it's possible that then it can be properly deleted
after that. Is there a way to work out by hand which shard an item
will go to by it's ID? What's the default Hashing algorithm?

Do an 'ids' search with explain turned on. That'll return the node and
shard:

curl -XGET 'http://127.0.0.1:9200/_all/_search?pretty=1' -d '
{
"fields" : ,
"query" : {
"ids" : {
"values" : [
"wVaNiUKTRE-Ax5LLR61OkA"
]
}
},
"explain" : 1
}
'

{

"hits" : {

"hits" : [

{

"_score" : 1,

"_index" : "test",

"_shard" : 2,

"_id" : "wVaNiUKTRE-Ax5LLR61OkA",

"_node" : "x8E-zFIsTGSXFo3xq4hb4A",

"_type" : "test",

"_explanation" : {

"value" : 1,

"details" : [

{

"value" : 1,

"description" : "boost"

},

{

"value" : 1,

"description" : "queryNorm"

}

],

"description" :

"ConstantScore(_uid:test#wVaNiUKTRE-Ax5LLR61OkA), product of:"

}

}

],

"max_score" : 1,

"total" : 1

},

"timed_out" : false,

"_shards" : {

"failed" : 0,

"successful" : 5,

"total" : 5

},

"took" : 2

}

clint

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks to Clinton's tip of finding the shard info for this ID I worked out
which node help the replica shard, and gracefully shut it down, let the
cluster replicate and rebalance (because I don't have options to prevent
allocation in 0.19.2) and then brought the node up again (and re-balanced).

Now the dead item no longer appears to come back in results, so I think
this dodgy replica shard is now history, though what happened to it is of
course, unclear.

But my problem now appears fixed (touch wood).

Thanks Clinton!

Paul

On 7 February 2013 08:00, Paul Smith tallpsmith@gmail.com wrote:

Thanks for that tip Clinton! That's identified the host with the shard
replica. If I run the above type of query I get flapping results (no
results, then a result from the host containing the replica shard), and I
can see it's the host with the replica of that shard that is always
returning that result.

My next step will be to shutdown ES on that node, let the cluster
re-replicate the shard to one of the other nodes and see if the problem
goes away, before bringing that node backup again.

I'll report back on what I find.

thanks again,

Paul

On 6 February 2013 23:10, Clinton Gormley clint@traveljury.com wrote:

Hi Paul

Rather than go down the replica=0 approach first, I tried optimizing
down to max_num_segments=1 during idle time, but the result still
comes up if it goes through the replica shard. using
preference=_primary always comes out as deleted ok, so I stilly
suspect the shard replica, though by optimizing I'm not sure why the
replica never received a 'cleansed' copy of the bad segment.

optimize happens local to each shard, so it wouldn't trigger a
"get-latest-from-primary" action.

I'm down to setting replicas to 0 now, which is a bit nail biting,
Murphy being the sod that he is, this is when stuff happens.
Alternatively I could do a rolling restart of each node, slowly, to
all the shards to rebalance, but I since I don't know which shard
contains this record, I may end up getting the replica copy instead of
the primary, but it's possible that then it can be properly deleted
after that. Is there a way to work out by hand which shard an item
will go to by it's ID? What's the default Hashing algorithm?

Do an 'ids' search with explain turned on. That'll return the node and
shard:

curl -XGET 'http://127.0.0.1:9200/_all/_search?pretty=1' -d '
{
"fields" : ,
"query" : {
"ids" : {
"values" : [
"wVaNiUKTRE-Ax5LLR61OkA"
]
}
},
"explain" : 1
}
'

{

"hits" : {

"hits" : [

{

"_score" : 1,

"_index" : "test",

"_shard" : 2,

"_id" : "wVaNiUKTRE-Ax5LLR61OkA",

"_node" : "x8E-zFIsTGSXFo3xq4hb4A",

"_type" : "test",

"_explanation" : {

"value" : 1,

"details" : [

{

"value" : 1,

"description" : "boost"

},

{

"value" : 1,

"description" : "queryNorm"

}

],

"description" :

"ConstantScore(_uid:test#wVaNiUKTRE-Ax5LLR61OkA), product of:"

}

}

],

"max_score" : 1,

"total" : 1

},

"timed_out" : false,

"_shards" : {

"failed" : 0,

"successful" : 5,

"total" : 5

},

"took" : 2

}

clint

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.