How can I delete everything except for top N documents?

Fritz · November 26, 2015, 12:44am

Hi folks,

I'm trying to delete all documents except for the top N sorted by some criteria and can't quite figure out how. I am able to retrieve those documents with a query using sort, from and size. We're using 1.6.

I've tried:

Delete by query - however I get the error: request does not support [sort]. I couldn't find any documentation saying that the "sort" parameter is not supported in delete by query.
Since delete by query is deprecated in later releases anyway I tried to use a scroll so I can feed the results in a bulk delete. However, the scroll does not seem to support starting at N.

So I guess the only reasonable solution is to use a cursor and simply discard the first N items.
Any other solutions? Will the Delete By Query plugin in 2.0 support "sort" and "from"?

Thanks
Fritz

warkolm · November 27, 2015, 6:40am

Delete by query is unlikely to support any sorting.

There's no real easy way to do this, so my question is, why are you even doing it to begin with?

Fritz · November 28, 2015, 12:30am

The "top N documents" are the requirements of the application I am building; I am trying to prototype that in Elasticsearch. That initial question eventually morphed into my trying to figure out how to delete documents properly in general.

It turns out that the solution I was speculating about at the end of my prior email worked like a charm (with a bit of client coding). I'm not sure about the performance, but for low document counts it works just fine.

Get a scroll on a query sorting by required criteria - in my case it's a date/time field - set a page size that makes sense
The query only retrieves the basic document properties, such as _type and _id, not the actual documents, which are irrelevant here
Use the scroll to retrieve page after page of results
Keep skipping documents (or whole pages) until we have skipped the desired number of documents.
When not skipping anymore, save the _type and _id of all documents in a list
When the scroll is complete, build a bulk query to delete all documents retrieved above by their type and id

If I refresh the index at the right points I can verify that the count of items is as expected before and after the delete.

This assumes a couple of things, to simplify any consistency issues that might arise:

No documents that would match the delete query are indexed at the same time the delete is performed.
The delete is performed after the whole cursor has been processed. I was tempted to delete after each page, which would reduce the number of item _ids that I have to cache, but deleting after seemed a safer alternative, at least for this prototype.

Cheers
Fritz

warkolm · November 28, 2015, 12:33am

I'd find the top N documents, put them into a new index and then delete the old one.
This is much simpler and you can use aliases to make things transparent for your application.

Fritz · November 28, 2015, 3:28am

That would be possible too, however there are many other documents in that index which do not match that delete query.

I find that a lot of proposed solutions to harder Elasticsearch problems hover around the idea that you should create another index. That, of course, is a legitimate solution, however in many cases this would lead to an explosion of micro-indexes, which would then have to be managed somehow and aggregated for larger queries. At this point I find it easier to limit the number of indexes and manage the data creating multiple types within an index. However it looks that even that freedom is going away in 2.0 where the fields with the same names in different types must have the same data type.

Topic		Replies	Views
Deleting first n documents from index Elasticsearch	3	760	July 6, 2017
Delete by query: keeping only the most recent N documents Elasticsearch	4	1229	February 19, 2021
Delete using scroll/scan API and Bulk request Example Elasticsearch	7	4422	July 5, 2017
Sort doesn't work with delete by query Elasticsearch	3	382	July 14, 2021
From parameter in delete by query API Elasticsearch	1	351	August 11, 2021

How can I delete everything except for top N documents?

Related topics