Very large delete_by_query - supported?

aidofitz · September 8, 2020, 12:29pm

Hi,

I'm trying to run a very large _delete_by_query, targeting ~1.5 billion documents in an index, amounting to 100's of GB of data. This is from a single index and in the context of a overall ES cluster side of ~80Tb.

I'm running the delete as

POST /myindex/_delete_by_query?wait_for_completion=false

with the appropriate query in the body. I get the task Id back and this runs about about 2.5hrs, at which point I see an error like below in the logs.

Is it possible to run such large delete by query operation, are any flags I could supply to get this to work? Or, do I manually have to break this down into smaller queries to get this to work?

Any help appreciated, thanks!

Regards,

Adrian

Error:

[2020-09-08T11:18:54,753][INFO ][o.e.t.LoggingTaskListener] [myserver] 93149909 finished with response BulkByScrollResponse[took=2.3h,timed_out=false,sliceId=null,updated=0,created=0,deleted=11123000,batches=11123,versionConflicts=0,noops=0,retries=0,throttledUntil=0s,bulk_failures=[],search_failures=[{"index":"myindex","shard":4,"node":"ll_9Zw08QWe3xVGlxBtWCQ","status":429,"reason":{"type":"circuit_breaking_exception","reason":"[parent] Data too large, data for [indices:data/read/search[phase/fetch/id/scroll]]would be [26738958650/24.9gb], which is larger than the limit of [26521423052/24.6gb], real usage: [26738954840/24.9gb], new bytes reserved: [3810/3.7kb], usages [request=0/0b, fielddata=212009452/202.1mb, in_flight_requests=3810/3.7kb, accounting=676743896/645.3mb]","bytes_wanted":26738958650,"bytes_limit":26521423052,"durability":"PERMANENT"}}]]

warkolm · September 8, 2020, 11:58pm

Can you split the query up somehow?

aidofitz · September 9, 2020, 8:24am

Hi Mark, thanks for responding back. I can - if necessary - split the query into multiple deletes (using a time range). However, I was hoping/looking to confirm whether I was missing a trick and ES could do this for me automatically, somehow? Or some other way that ES would handle the large delete without requiring a lot of manual effort on my side! Thanks!

warkolm · September 9, 2020, 11:37pm

Your other option would be to do a reindex into a new index that only contains the docs you want, and slice that.

aidofitz · September 10, 2020, 8:40am

Thanks, but am not sure if that would help - my understanding is that reindexing does not remove the document from the original index, which is ultimately what I want to achieve. Or am I missing something?

What I've done for now is write a script that time-slices the range I want to operate on hour by hour, and then executes a delete for each hour. However this is far from ideal, as different hours can contain significantly different amount of docs, and I need to take this into account when sleeping between delete requests (as am executing as tasks and don't want to overhelm the system with too many delete tasks. Maybe im over-complicating this! Perhaps a feature enhancement to _delete_by_query in the future to do some of this stuff automatically?

Thanks for the help anyway!

warkolm · September 10, 2020, 8:42am

If you reindex into a new index, you can delete the old one and then add an alias to the new one.

Basically, you cannot escape the cost of what you want to do here. It's up to you to figure out what you are willing to accept

system · October 8, 2020, 8:42am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Delete large amount of data using _delete_by_query Elasticsearch	3	2502	April 14, 2017
Delete by query limitation Elasticsearch	2	1230	October 2, 2019
Delete by query deletes only 1000 documents, then quits Elasticsearch	8	217	February 5, 2024
Delete by query is not deleting documents from an Index Elasticsearch	4	1152	July 5, 2017
ES 1.5 Delete By Query API not working Elasticsearch	11	2223	January 1, 2018

Very large delete_by_query - supported?

Related topics