Is Delete by query a clean operation?

Hi Everyone,

I wanted to know if delete by query is a clean API? Are the resources (Memory or I/O) released gracefully after the query is complete or does it still hold some resource which might affect perormance?

All resources are released when no longer needed (at least, if they aren't then that's a bug).

Thanks for the reply @DavidTurner. I had one more question related to delete by query: How much time does it take to execute, let's say on some given size of data?

That depends on so many details of the data and the cluster configuration that it's basically impossible to answer. You should benchmark it on a representative sample of your data.

@DavidTurner do you have any sample use case based on your experience? I can use that for a guesstimate.

Elasticsearch relies on Lucene, which used immutable segments to store data. Deleting a document therefore does not happen in place, and a delete is therefore similar to an update with respect to performance as a tombstone record need to be written to a new segment replacing the deleted document. The actual document is then removed during merging of segments at some later time. Deleting lots of documents can therefore be very resource intensive and take a while to run.

How much data do you have? How much do you delete each time? Are you planning on running multiple deletes in parallel? How frequently are you instering or updating data apart from the deletions?

Hi @Christian_Dahlqvist to answer your questions:

How much data do you have?
Ans - 50 GB in 1 Index (1 Shard per index)

How much do you delete each time?
Ans - 10% ~ 5 GB of Data

Are you planning on running multiple deletes in parallel?
Ans - Yes, around 1000 parallel delete operations will be running to delete 5 GB of data.

I suspect that will be a problem and would recommend you test/benchmark to make sure.

@Christian_Dahlqvist How much time will it take approximately if we trigger 1 query which deletes 5 GB of data from an index of 50 GB?

It depends on your cluster and data as well as what other load the cluster is under and the query so you need to test. If you are looking for an order of magnitude it may be similar to how long it would take to update the same amount of data. Deleting a lot of data can take a long time this way.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.