Concurrent delete_by_query and indexing

Serg_Pilipenko · December 13, 2016, 12:39pm

Hi there,

I am using delete_by_query plugin and it's work is not quite predictable.

Here is step by step of what I do:

There is a doc in index for a long while {id: 1 and _sync_timestamp: 1}
Then I concurrently run the following actions:

I index the updated doc
index {document with id: 1 and _sync_timestamp: 2}
I run delete by query
{must id: 1, must not timstamp gte: 2}

The idea behind is simple: since delete by query is asynchronous it may delete just updated docs, that's why I use _sync_ts + must_not gte on _sync_ts.

The problem is that sometimes(very frequently) delete_by_query returns number of failed to delete > 0. I don't know the reason since I am not familiar with logic the under the hood.

As far as I understand there few possible scenarios, please correct me if I am wrong:

if updated doc has been just indexed and merged delete_by_query just skips it because of _sync_ts
doc is indexed but not merged(old doc is marked as deleted, new doc is waiting to appear in index) delete_by_query fails on this doc
doc is not acknowledged at all: delete_by_query marks doc with _sync_ts as deleted and then new doc just indexed separately, no merge.

Am I right that delete_by_query fails on (2) scenario? Is there a better way to do the same, but w/o dealing with docs failed to delete?

Thanks,
Sergii

cbuescher · December 13, 2016, 5:12pm

Hi,

First a question, I'm confused about what you mean by "id". Is this the elasticsearch "_id" document id or some field you are setting exlicitely? In the first case, you can simply update the document without having to delete the old one, that should be done automatically.

Regarding the second part of your question: _delete_by_query gets a snapshot of the index when it starts and deletes what it finds using internal versioning. That means it is not concerned with marking or merging documents on the lucene shard level. That means, if the document changes between the time when the snapshot was taken and when the delete request is processed, you get a version conflict.

Serg_Pilipenko · December 13, 2016, 9:07pm

sorry for confusing definitions. id doesn't mean Elastic's _id. it's possible to have few docs in ES with the same id.

In my application I want ES is to be in sync with DB, I have sync function for this. sync is being run in queue, and after sync(my_id) cal doc may change state from needs to be indexed to needs to be deleted.

my_id represents one object in application, but may be represented by a few objects in ES. The cheapest way to perform sync of such domain is to index all docs we can find in db by my_id, and delete all old docs matching my_id criteria.

That means, if the document changes between the time when the snapshot was taken and when the delete request is processed, you get a version conflict.

What is the best practice to deal with such conflict? Perform deletion, if there are failed docs wait and perform one more?

system · January 10, 2017, 9:07pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Delete_by_query and index concurrency Elasticsearch	1	297	July 6, 2017
Elasticsearch delete_by_query 409 version conflict Elasticsearch	9	26970	April 27, 2019
Delete By Query and index Size Elasticsearch	2	1761	June 15, 2018
Delete by query conflict Elasticsearch reindex	1	283	January 31, 2024
Delete by query deletes only 1000 documents, then quits Elasticsearch	8	277	February 5, 2024

Concurrent delete_by_query and indexing

Related topics