Possible indexing race condition for simulatneous add/delete


(ppearcy) #1

Hey,
I think that I have stumbled upon a minor bug in 0.13.0. We have
lots of various data sources flowing in, most of which have a healthy
pattern of adds/updates/deletes for each doc, ie in most cases we have
a good amount of delay between these operations.

We have a couple of data sources that have a somewhat sketchy setup,
resulting in nearly simultaneous add, update, deletes for a document.
There are updates we're going to make on our side to address this type
of needless thrashing.

However, some automated testing I have set up caught 7 instances of a
document being available on one shard and not another. For example, I
can search on that document and in a two node cluster, and every
refresh I do it will appear than disappear as the shards are round
robined. I believe, the only way an issue like this could surface is a
bug within ES. My random guess is some ordering or race condition when
a refresh occurs or when items are written to the translog.

If the bug was on ordering on my side, I would end up with documents
out of sync with my data store versus shard replicas being out of
sync.

I think this is one of those problems that will be a pain to
reproduce. You'd need a cluster with at least two nodes with a test
app running against it to have two threads where one receives an add
and the other receives the delete. Which one sticks would be non-
deterministic, but either way, there shouldn't be any drift between
the shard replicas.

I don't consider this a major issue and it isn't causing me pain, but
I wanted to point out what I have observed.

Let me know if any more details would be of use.

Thanks,
Paul


(Shay Banon) #2

Let me try and reproduce this on my end, and I will ping back...

On Fri, Dec 17, 2010 at 2:06 AM, Paul ppearcy@gmail.com wrote:

Hey,
I think that I have stumbled upon a minor bug in 0.13.0. We have
lots of various data sources flowing in, most of which have a healthy
pattern of adds/updates/deletes for each doc, ie in most cases we have
a good amount of delay between these operations.

We have a couple of data sources that have a somewhat sketchy setup,
resulting in nearly simultaneous add, update, deletes for a document.
There are updates we're going to make on our side to address this type
of needless thrashing.

However, some automated testing I have set up caught 7 instances of a
document being available on one shard and not another. For example, I
can search on that document and in a two node cluster, and every
refresh I do it will appear than disappear as the shards are round
robined. I believe, the only way an issue like this could surface is a
bug within ES. My random guess is some ordering or race condition when
a refresh occurs or when items are written to the translog.

If the bug was on ordering on my side, I would end up with documents
out of sync with my data store versus shard replicas being out of
sync.

I think this is one of those problems that will be a pain to
reproduce. You'd need a cluster with at least two nodes with a test
app running against it to have two threads where one receives an add
and the other receives the delete. Which one sticks would be non-
deterministic, but either way, there shouldn't be any drift between
the shard replicas.

I don't consider this a major issue and it isn't causing me pain, but
I wanted to point out what I have observed.

Let me know if any more details would be of use.

Thanks,
Paul


(system) #3