Versioning and out-of-order deletes

Background

I am using ES as a search backend for a DB based 'system of record'. The DB
maintains revision numbers for documents (also deletes via a tombstone
concept). Indexing is done asynchronously via a messaging system. Indexing
messages (updates and deletes) may therefore be duplicated or reordered. I
am trying to use the (cool) ES versioning feature to prevent stale data in
the index.

Problem

Preventing stale data is easy enough for updates (just need to drop messages
if ES reports a version conflict). My problem are currently out-of-order
delete requests like in this example (revision number in brackets ) :

update[1], update[2], update[3], delete[4] getting reordered into

update[1], delete[4], update[2], update[3]

I cannot really use the versioning support for delete requests as this would
cause the reordered delete to be rejected (I would need to buffer it somehow
to be replayed later which I want to avoid). If I do the delete regardless
of the version it seems to auto-increment the version on the ES side. The
update with revision 2 then fails (good) but update with revision 3 succeeds
(bad).

My Solution

My (possibly stupid) solution would be to build a kind of tombstone concept
for my index docs. I would turn the delete[4] into an update[4] and
additional mark the document as deleted (e.g. adding a deletion timestamp in
the json). With the delete now being a versioned update it would be accepted
by ES and would cause delayed updates [2] and [3] to be skipped due to
version conflict. If the doc got recreated via update[5] this would work as
well.

I would of course have all queries to be filtered implicitly by the deletion
marker (e.g. field existence filter on the deletion timestamp field) to make
sure I never see deleted docs in search results. I would also need a kind of
garbage collection that really deletes the tombstone documents after some
grace period but this should be trivial based on the deletion timestamp.

My Concern

I know this scheme replicates in part what is done internally in Lucene
anyway. I therefore have the feeling of reinventing the wheel and being on
the wrong track. Does anybody have a better idea on how to handle
out-of-order deletes ?

The reason why the version gest rejected is because it does not match
specifically to the version you try and update (thats how optimistics
locking will work).

But in your case, when you apply the changes to the other cluster, set the
version_type to external. In this case, the check for rejection is:
currentVersion >= deleteVersionProvided.

Also, if you do cross cluster replication, make sure to increase the delete
garbage collection time. It defaults to 60 seconds. The setting
is index.gc_deletes and have time value, for example, 1h.

On Thu, Sep 15, 2011 at 11:10 AM, Jan Fiedler fiedler.jan@gmail.com wrote:

Background

I am using ES as a search backend for a DB based 'system of record'. The DB
maintains revision numbers for documents (also deletes via a tombstone
concept). Indexing is done asynchronously via a messaging system. Indexing
messages (updates and deletes) may therefore be duplicated or reordered. I
am trying to use the (cool) ES versioning feature to prevent stale data in
the index.

Problem

Preventing stale data is easy enough for updates (just need to drop
messages if ES reports a version conflict). My problem are currently
out-of-order delete requests like in this example (revision number in
brackets ) :

update[1], update[2], update[3], delete[4] getting reordered into

update[1], delete[4], update[2], update[3]

I cannot really use the versioning support for delete requests as this
would cause the reordered delete to be rejected (I would need to buffer it
somehow to be replayed later which I want to avoid). If I do the delete
regardless of the version it seems to auto-increment the version on the ES
side. The update with revision 2 then fails (good) but update with revision
3 succeeds (bad).

My Solution

My (possibly stupid) solution would be to build a kind of tombstone concept
for my index docs. I would turn the delete[4] into an update[4] and
additional mark the document as deleted (e.g. adding a deletion timestamp in
the json). With the delete now being a versioned update it would be accepted
by ES and would cause delayed updates [2] and [3] to be skipped due to
version conflict. If the doc got recreated via update[5] this would work as
well.

I would of course have all queries to be filtered implicitly by the
deletion marker (e.g. field existence filter on the deletion timestamp
field) to make sure I never see deleted docs in search results. I would also
need a kind of garbage collection that really deletes the tombstone
documents after some grace period but this should be trivial based on the
deletion timestamp.

My Concern

I know this scheme replicates in part what is done internally in Lucene
anyway. I therefore have the feeling of reinventing the wheel and being on
the wrong track. Does anybody have a better idea on how to handle
out-of-order deletes ?

Thanks for the quick response Shay. I am using version_type = external for
the updates. I do not yet understand how to handle the out-of-order delete
based on your comments. If I assign it the external version number [4] it
will fail (as the index document is at version [1] when I do the delete). I
know this is how optimistic control is supposed to work but it leaves me
with the problem of having to buffer the delete until its time to apply it
again. I was looking for a way not having to buffer deletes and hence ended
up with the concept I outlined. Am I missing something ?

Mmm, thats a bug in the REST API not properly extracting the version_type
parameter. Opened an issue:
Rest Delete API does not honor the `version_type` parameter · Issue #1337 · elastic/elasticsearch · GitHub, fixed in 0.17
branch and master. 0.17.7 is scheduled for tomorrow.

On Thu, Sep 15, 2011 at 4:21 PM, Jan Fiedler fiedler.jan@gmail.com wrote:

Thanks for the quick response Shay. I am using version_type = external for
the updates. I do not yet understand how to handle the out-of-order delete
based on your comments. If I assign it the external version number [4] it
will fail (as the index document is at version [1] when I do the delete). I
know this is how optimistic control is supposed to work but it leaves me
with the problem of having to buffer the delete until its time to apply it
again. I was looking for a way not having to buffer deletes and hence ended
up with the concept I outlined. Am I missing something ?

are those deleted documents stored in the index files until garbage
collection? (so resilient to complete cluster/node failures?)

Hi Shay,

I did some more testing and it works pretty well if I use version_type =
external for the updates *and *deletes consistently (via Java API). I was
also able to see the effect of the deleted document cache to expire (as you
pointed out). I would be very interested in whether this cache is persisted
somehow (as Ian below is asking as well).

However, there still seems to be a problem when the* delete *arrives before
the first update
(i.e. deleting a document that is not there). The
out-of-order update always succeeds in this case. So it seems that the
delete for not existing documents is not recorded in the deleted documents
cache. Is this correct? Is there any way around this?

The record of deleted documents tombstones are not persisted, though it is
"replicated".

Regarding your other question, I think that in this case, where the delete
comes before an update, we can create a tombstone for it (especially for
external version value). Can you open an issue?

On Fri, Sep 16, 2011 at 4:39 PM, Jan Fiedler fiedler.jan@gmail.com wrote:

Hi Shay,

I did some more testing and it works pretty well if I use version_type =
external for the updates *and *deletes consistently (via Java API). I was
also able to see the effect of the deleted document cache to expire (as you
pointed out). I would be very interested in whether this cache is persisted
somehow (as Ian below is asking as well).

However, there still seems to be a problem when the* delete *arrives before
the first update
(i.e. deleting a document that is not there). The
out-of-order update always succeeds in this case. So it seems that the
delete for not existing documents is not recorded in the deleted documents
cache. Is this correct? Is there any way around this?

Sorry, in terms of persistency, since deletes are stored in the transaction
log, then whatever there is in the transaction log will get replied and
those will remain (with the proper version) even after full cluster
restart.

On Fri, Sep 16, 2011 at 8:50 PM, Shay Banon kimchy@gmail.com wrote:

The record of deleted documents tombstones are not persisted, though it is
"replicated".

Regarding your other question, I think that in this case, where the delete
comes before an update, we can create a tombstone for it (especially for
external version value). Can you open an issue?

On Fri, Sep 16, 2011 at 4:39 PM, Jan Fiedler fiedler.jan@gmail.comwrote:

Hi Shay,

I did some more testing and it works pretty well if I use version_type =
external for the updates *and *deletes consistently (via Java API). I was
also able to see the effect of the deleted document cache to expire (as you
pointed out). I would be very interested in whether this cache is persisted
somehow (as Ian below is asking as well).

However, there still seems to be a problem when the* delete *arrives before
the first update
(i.e. deleting a document that is not there). The
out-of-order update always succeeds in this case. So it seems that the
delete for not existing documents is not recorded in the deleted documents
cache. Is this correct? Is there any way around this?

Opened issue: https://github.com/elasticsearch/elasticsearch/issues/1351