Fastest way to delete a lot of individual documents


(jeroen1) #1

I've got a index with billions of documents. Once in a while I push an update. The update contains new documents and documents that replace older documents. I'd like to delete the document that are replaced, typically a few million for each update.

From a technical view, the documents I'd like to delete are quite random: there's not a single or simple combination of properties that allows me to select them all in a single search.

The way I'm deleting now is pretty basic: I create a file with all entries to delete (so a few million lines) and feed it to a BASH script that fires cURL in a for loop with a _delete_by_query with a query like {"query":{"bool":{"must":[{"term":{"a":123}},{"term":{"b":"abc"}},{"term":{"c":19800101}},{"term":{"d":"a1b2c3"}}]}}}. Works ok, but not really fast...

Any suggestions for a faster way of deleting individual documents in bulk?

For imports I'm now using (3rd party) elasticdump that uses the bulk API. I'm looking for something like that that supports deleting entries based an an input file that contains the documents to delete...

Thank you for your help!


(Thiago Souza) #2

Is there any reasons you want to delete first and update afterwards? Technically, when you update a document the older version will be marked as deleted anyway, so there is any real advantage of deleting it first, which would just mark it as deleted the same way update does.


(jeroen1) #3

The update contains a new document that superseeds the old document's informations and therby making the old information legacy. It does not replace it the ES document. So +1 record. Then the old document goes to an archive elsewhere and needs to be deleted (-1 document, total +/-0 records), new document is live and recent.


(Thiago Souza) #4

I suppose you are not using the same id for old and new document. So the fastest way would be using the Bulk API.


(jeroen1) #5

I suppose you are not using the same id for old and new document.

Correct.

So the fastest way would be using the Bulk API.

How can I do that with documents that I cannot find in a single search? Can I use the bulk API for deleting the a file containing a lot of lines as input?

Thanks!


(Thiago Souza) #6

You can delete a document per line, but it has a specific format. Refer to the Bulk API


(system) #7

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.