Delete_by_query vs client managed batches send to be deleted

marekott · October 15, 2025, 1:56pm

Hi, first lets make some assumptions. We have an index containing 90 000 000 documents. Due to application logic we need to delete 30 000 000 docs (30% of data). Now I know that the best solution in terms of performance would be to reindex data that should not be deleted to new index and delete the old one. But lets for sake of this discussion exclude this option. So the other two I see are:

Run async DBQ with proper throttling in order to not kill cluster.
From client application fetch batch of data that meets the criteria for deletion and send it to be deleted. Repeat until no data for deletion left. Also provide proper throttling between batches.

Which one would be better in terms of cluster health and performance? I am mainly concerned about initial snapshot of data that is created when DBQ is used and how it will affect cluster when enormous volume of data needs to be deleted. Is this feature (DBQ) even designed for deleting such amount of data? The time that this operation will take to complete is irrelevant for me but not damaging cluster performance is my priority.

Wolfram_Haussig · October 17, 2025, 7:47am

Hi,

The question is: What do you care about? Do you just need the docs gone or do you want to reclaim the storage?

DBQ is the best in regards to performance, but if you want to reclaim the storage too you might want to go with reindexing it to a new index and deleting the new index afterwards:

DBQ does not really delete the documents (segments in Elasticsearch are readonly), the documents are only marked as deleted.

To really delete documents that are marked you would need to run a force merge. Please read the docs on force merges first before running this operation.

Best regards

Wolfram

marekott · October 17, 2025, 8:25am

Thanks @Wolfram_Haussig, I have familiarized myself with provided docs. My main objective is to delete those documents as this is the process of removing data of former client. However after that reclaiming the storage would also be desired. The docs sates that These soft-deleted documents are automatically cleaned up during regular segment merges., so is force merge required or can I relay on automatic merging and the storage will be reclaimed after some time? In my current architecture this index is still receiving new documents (mentioning it since as I can see in docs force merge is suggested to be used only on read only indexes).

Wolfram_Haussig · October 17, 2025, 9:33am

is force merge required or can I relay on automatic merging and the storage will be reclaimed after some time

It will be reclaimed after some time, but you don't know when. For reference, here are a few links (older, but I assume they are still relevant):

marekott · October 17, 2025, 10:46am

Thanks

Topic		Replies	Views
Delete efficiently Elasticsearch	2	398	July 6, 2017
Delete by query and index performance Elasticsearch	3	1895	July 6, 2017
The problem of a lot of deleted docs? Elasticsearch	8	24222	July 5, 2017
Delete document from elasticsearch Elasticsearch	4	357	February 9, 2021
Ways to purge deleted documents in 4TB indice Elasticsearch	6	745	January 12, 2019

Delete_by_query vs client managed batches send to be deleted

Related topics