Concurrent delete_by_query and indexing

Hi there,

I am using delete_by_query plugin and it's work is not quite predictable.

Here is step by step of what I do:

There is a doc in index for a long while {id: 1 and _sync_timestamp: 1}
Then I concurrently run the following actions:

  1. I index the updated doc
    index {document with id: 1 and _sync_timestamp: 2}
  2. I run delete by query
    {must id: 1, must not timstamp gte: 2}

The idea behind is simple: since delete by query is asynchronous it may delete just updated docs, that's why I use _sync_ts + must_not gte on _sync_ts.

The problem is that sometimes(very frequently) delete_by_query returns number of failed to delete > 0. I don't know the reason since I am not familiar with logic the under the hood.

As far as I understand there few possible scenarios, please correct me if I am wrong:

  1. if updated doc has been just indexed and merged delete_by_query just skips it because of _sync_ts
  2. doc is indexed but not merged(old doc is marked as deleted, new doc is waiting to appear in index) delete_by_query fails on this doc
  3. doc is not acknowledged at all: delete_by_query marks doc with _sync_ts as deleted and then new doc just indexed separately, no merge.

Am I right that delete_by_query fails on (2) scenario? Is there a better way to do the same, but w/o dealing with docs failed to delete?

Thanks,
Sergii

Hi,

First a question, I'm confused about what you mean by "id". Is this the elasticsearch "_id" document id or some field you are setting exlicitely? In the first case, you can simply update the document without having to delete the old one, that should be done automatically.

Regarding the second part of your question: _delete_by_query gets a snapshot of the index when it starts and deletes what it finds using internal versioning. That means it is not concerned with marking or merging documents on the lucene shard level. That means, if the document changes between the time when the snapshot was taken and when the delete request is processed, you get a version conflict.

1 Like

sorry for confusing definitions. id doesn't mean Elastic's _id. it's possible to have few docs in ES with the same id.

In my application I want ES is to be in sync with DB, I have sync function for this. sync is being run in queue, and after sync(my_id) cal doc may change state from needs to be indexed to needs to be deleted.

my_id represents one object in application, but may be represented by a few objects in ES. The cheapest way to perform sync of such domain is to index all docs we can find in db by my_id, and delete all old docs matching my_id criteria.

That means, if the document changes between the time when the snapshot was taken and when the delete request is processed, you get a version conflict.

What is the best practice to deal with such conflict? Perform deletion, if there are failed docs wait and perform one more?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.