Hi,
I want to delete over 10 million documents.
Which approach is better and faster? Loop and delete each document with DELETE //_doc/<_id> OR perform deletebyquery?
Thanks...
Hi,
I want to delete over 10 million documents.
Which approach is better and faster? Loop and delete each document with DELETE //_doc/<_id> OR perform deletebyquery?
Thanks...
Are they all part of the same index?
How many documents do you have in total?
Index size is 50GB.
They can be in 1-2 indexes.
I did not mean the size on disk but the number of documents.
What is the output of:
GET /_cat/indices?v
I do not think there is much difference. Behind the scenes delete by query runs a query to get the ids and then take these and send bulk delete requests by id. This is probably what you would be doing anyway. It will reduce network round trips but maybe use lower concurrency, so it could go either way.
green open base_elements_rwr-2022.06.23-000020 JpzGSsMzR5-zAMdAi3Qp9w 1 0 73474700 0 16gb 16gb
Is the delete impact write performance?
What is the recommended approach to delete documents (over 10 million) without impacting write\read performance?
Do I have to reindex after each delete?
If I run delete with wait_for_completion=false, should I delete the task manually after completed? If yes how (It is system index)?
Deletes in Elasticsearch are basically a soft delete where a tombstone record is created and the original data removed after merging. It will therefore almost act as an update and will result in disk I/O and additional merging activity which can affect read and write performence.
The most efficient way to delete data from Elasticsearch is to delete complete indices. This is why time-based indices are often used for immutable data like logs. If you are deleteing data based on age this may be an option. If you have more complex criteria there is no way to delete without potentially affecting performance in some way. You can do it slowly and spread out the load but that is it.
No, but you may want to run a refresh if you are have modified the refresh interval.
No.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.
© 2020. All Rights Reserved - Elasticsearch
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant logo are trademarks of the Apache Software Foundation in the United States and/or other countries.