Very large delete_by_query - supported?

Hi,

I'm trying to run a very large _delete_by_query, targeting ~1.5 billion documents in an index, amounting to 100's of GB of data. This is from a single index and in the context of a overall ES cluster side of ~80Tb.

I'm running the delete as

POST /myindex/_delete_by_query?wait_for_completion=false

with the appropriate query in the body. I get the task Id back and this runs about about 2.5hrs, at which point I see an error like below in the logs.

Is it possible to run such large delete by query operation, are any flags I could supply to get this to work? Or, do I manually have to break this down into smaller queries to get this to work?

Any help appreciated, thanks!

Regards,

Adrian

Error:

[2020-09-08T11:18:54,753][INFO ][o.e.t.LoggingTaskListener] [myserver] 93149909 finished with response BulkByScrollResponse[took=2.3h,timed_out=false,sliceId=null,updated=0,created=0,deleted=11123000,batches=11123,versionConflicts=0,noops=0,retries=0,throttledUntil=0s,bulk_failures=[],search_failures=[{"index":"myindex","shard":4,"node":"ll_9Zw08QWe3xVGlxBtWCQ","status":429,"reason":{"type":"circuit_breaking_exception","reason":"[parent] Data too large, data for [indices:data/read/search[phase/fetch/id/scroll]]would be [26738958650/24.9gb], which is larger than the limit of [26521423052/24.6gb], real usage: [26738954840/24.9gb], new bytes reserved: [3810/3.7kb], usages [request=0/0b, fielddata=212009452/202.1mb, in_flight_requests=3810/3.7kb, accounting=676743896/645.3mb]","bytes_wanted":26738958650,"bytes_limit":26521423052,"durability":"PERMANENT"}}]]

Can you split the query up somehow?

Hi Mark, thanks for responding back. I can - if necessary - split the query into multiple deletes (using a time range). However, I was hoping/looking to confirm whether I was missing a trick and ES could do this for me automatically, somehow? Or some other way that ES would handle the large delete without requiring a lot of manual effort on my side! Thanks!

Your other option would be to do a reindex into a new index that only contains the docs you want, and slice that.

Thanks, but am not sure if that would help - my understanding is that reindexing does not remove the document from the original index, which is ultimately what I want to achieve. Or am I missing something? :slight_smile:

What I've done for now is write a script that time-slices the range I want to operate on hour by hour, and then executes a delete for each hour. However this is far from ideal, as different hours can contain significantly different amount of docs, and I need to take this into account when sleeping between delete requests (as am executing as tasks and don't want to overhelm the system with too many delete tasks. Maybe im over-complicating this! Perhaps a feature enhancement to _delete_by_query in the future to do some of this stuff automatically?

Thanks for the help anyway!

If you reindex into a new index, you can delete the old one and then add an alias to the new one.

Basically, you cannot escape the cost of what you want to do here. It's up to you to figure out what you are willing to accept :slight_smile: