Hi, first lets make some assumptions. We have an index containing 90 000 000 documents. Due to application logic we need to delete 30 000 000 docs (30% of data). Now I know that the best solution in terms of performance would be to reindex data that should not be deleted to new index and delete the old one. But lets for sake of this discussion exclude this option. So the other two I see are:
Run async DBQ with proper throttling in order to not kill cluster.
From client application fetch batch of data that meets the criteria for deletion and send it to be deleted. Repeat until no data for deletion left. Also provide proper throttling between batches.
Which one would be better in terms of cluster health and performance? I am mainly concerned about initial snapshot of data that is created when DBQ is used and how it will affect cluster when enormous volume of data needs to be deleted. Is this feature (DBQ) even designed for deleting such amount of data? The time that this operation will take to complete is irrelevant for me but not damaging cluster performance is my priority.
The question is: What do you care about? Do you just need the docs gone or do you want to reclaim the storage?
DBQ is the best in regards to performance, but if you want to reclaim the storage too you might want to go with reindexing it to a new index and deleting the new index afterwards:
DBQ does not really delete the documents (segments in Elasticsearch are readonly), the documents are only marked as deleted.
To really delete documents that are marked you would need to run a force merge. Please read the docs on force merges first before running this operation.
Thanks @Wolfram_Haussig, I have familiarized myself with provided docs. My main objective is to delete those documents as this is the process of removing data of former client. However after that reclaiming the storage would also be desired. The docs sates that These soft-deleted documents are automatically cleaned up during regular segment merges., so is force merge required or can I relay on automatic merging and the storage will be reclaimed after some time? In my current architecture this index is still receiving new documents (mentioning it since as I can see in docs force merge is suggested to be used only on read only indexes).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.