The "top N documents" are the requirements of the application I am building; I am trying to prototype that in Elasticsearch. That initial question eventually morphed into my trying to figure out how to delete documents properly in general.
It turns out that the solution I was speculating about at the end of my prior email worked like a charm (with a bit of client coding). I'm not sure about the performance, but for low document counts it works just fine.
- Get a scroll on a query sorting by required criteria - in my case it's a date/time field - set a page size that makes sense
- The query only retrieves the basic document properties, such as _type and _id, not the actual documents, which are irrelevant here
- Use the scroll to retrieve page after page of results
- Keep skipping documents (or whole pages) until we have skipped the desired number of documents.
- When not skipping anymore, save the _type and _id of all documents in a list
- When the scroll is complete, build a bulk query to delete all documents retrieved above by their type and id
If I refresh the index at the right points I can verify that the count of items is as expected before and after the delete.
This assumes a couple of things, to simplify any consistency issues that might arise:
- No documents that would match the delete query are indexed at the same time the delete is performed.
- The delete is performed after the whole cursor has been processed. I was tempted to delete after each page, which would reduce the number of item _ids that I have to cache, but deleting after seemed a safer alternative, at least for this prototype.