Delete by Query and Refresh Interval

gavinlee · October 10, 2020, 8:52am

Hi, I have a situation where I take the following steps

retrieve lock
delete by query
refresh
insert documents
release lock

These steps could happen in quick succession. The delete by query is supposed to find all the documents previously inserted and delete them all. For the most part, this has been working totally fine. However, I just discovered an incident where after these steps are complete, there are more documents than expected. I haven't been able to replicate this, but came up with the following hypothesis

Consider the following example with refresh interval set to 1s:

Iteration 1:

retrieve lock
delete by query (nothing to delete)
refresh
insert documents with IDs 1, 2, 3, all with a property X with value of Y
release lock

Immediately after (less than 1s), Iteration 2 occurs:

retrieve lock
delete by query for all documents with property X with value of Y
refresh
insert documents with IDs 2, 3, 4, all with a property X with value of Y
release lock

The final state I expected is to have only documents with IDs 2, 3, 4, but I ended up with documents with IDs 1, 2, 3, 4

Of course there may be other issues with my code, but my suspicion is that since there was no refresh performed between Iteration 1 and Iteration 2, and no refresh happened due to the 1s refresh interval not being reached, the delete by query failed to find and delete the documents with IDs 1, 2 and 3. Then when the insert phase of Iteration 2 happens, IDs 1, 2 and 3 already exist, so IDs 2 and 3 are updated and ID 4 is inserted.

Is this a possible explanation for my unexpected state?

Something that concerns me is that this hypothesis seems to contradict the following post: Elasticsearch delete_by_query 409 version conflict where a 409 error is expected instead. I did not receive any 409 errors in the example provided. Also, in the post, it's mentioned that the 409 error is due to the lack of refresh between the insert and delete by query. But if there's no refresh, that means the documents cannot be searched, so how can the delete by query find the document to delete in the first place?

Any insight would be greatly appreciated. Thanks!

gavinlee · October 13, 2020, 12:47am

For more info, the insert step is completed using Bulk Processor

flash1293 · October 13, 2020, 7:38am

Delete by query only works for indexed documents (because it relies on the index to find the documents) - so you need to make sure the index is up to date before using the next delete by query step.

Why are you refreshing before inserting the documents? It seems like by moving refresh after insertion your problem could be solved.

gavinlee · October 14, 2020, 6:15am

Thanks for the response! That makes sense, I agree that moving the refresh after insertion should resolve the problem.

My only open question is how come the 409 error occurs in the discussion here: Elasticsearch delete_by_query 409 version conflict

My expectation is that delete by query would just find no documents to delete since a refresh has not occurred since the documents were inserted. How can a 409 error occur if the delete by query only finds indexed documents? (I'm assuming by "indexed" you mean that a refresh has occurred after the insertion) It seems like from the discussion, the 409 error happens because somehow the delete by query finds an older version of the document that was just inserted. I'm probably missing something, but the document doesn't exist before the step of the document being indexed so how can there be a version conflict?

The only explanation I can think of is that the original poster meant that an update document request is sent instead of a create document request. Then delete by query would find the old version of the document and attempt to delete that and a 409 error will be returned

flash1293 · October 14, 2020, 7:31am

Ah, I see - it can lead to a problem if the "delete by query" and document insertion happens in parallel and overlaps partially. If the timing is unlucky it could lead to some the the newly inserted documents to be deleted again right away (that's what the 409 is telling you).

To be really save, you need to make sure the index is in a consistent state after both inserting and deleting. For insertion you can make sure by adding refresh=wait_for (https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-refresh.html) to the request - this will make sure the request only completes after the index is updated.

gavinlee · October 16, 2020, 7:26am

Thanks Joe!

If my pattern is always

Delete by query
Bulk insert
Refresh
Repeat

Then would adding a Refresh between steps 1 and 2 help in any way?

Christian_Dahlqvist · October 16, 2020, 8:54am

Delete by query add tombstone records so I would expect a refresh after this phase to help.

system · November 13, 2020, 8:54am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch delete_by_query 409 version conflict Elasticsearch	9	26646	April 27, 2019
Refresh sync or async? Elasticsearch	2	1262	October 9, 2017
Delete by query after realtime GET Elasticsearch	4	626	November 4, 2022
Concurrent delete_by_query and indexing Elasticsearch	3	1059	January 10, 2017
Issue with consecutive ElasticSearch Query with Java API Elasticsearch	5	432	May 11, 2018

Delete by Query and Refresh Interval

Related topics