I'm trying to delete all documents except for the top N sorted by some criteria and can't quite figure out how. I am able to retrieve those documents with a query using sort, from and size. We're using 1.6.
I've tried:
Delete by query - however I get the error: request does not support [sort]. I couldn't find any documentation saying that the "sort" parameter is not supported in delete by query.
Since delete by query is deprecated in later releases anyway I tried to use a scroll so I can feed the results in a bulk delete. However, the scroll does not seem to support starting at N.
So I guess the only reasonable solution is to use a cursor and simply discard the first N items.
Any other solutions? Will the Delete By Query plugin in 2.0 support "sort" and "from"?
The "top N documents" are the requirements of the application I am building; I am trying to prototype that in Elasticsearch. That initial question eventually morphed into my trying to figure out how to delete documents properly in general.
It turns out that the solution I was speculating about at the end of my prior email worked like a charm (with a bit of client coding). I'm not sure about the performance, but for low document counts it works just fine.
Get a scroll on a query sorting by required criteria - in my case it's a date/time field - set a page size that makes sense
The query only retrieves the basic document properties, such as _type and _id, not the actual documents, which are irrelevant here
Use the scroll to retrieve page after page of results
Keep skipping documents (or whole pages) until we have skipped the desired number of documents.
When not skipping anymore, save the _type and _id of all documents in a list
When the scroll is complete, build a bulk query to delete all documents retrieved above by their type and id
If I refresh the index at the right points I can verify that the count of items is as expected before and after the delete.
This assumes a couple of things, to simplify any consistency issues that might arise:
No documents that would match the delete query are indexed at the same time the delete is performed.
The delete is performed after the whole cursor has been processed. I was tempted to delete after each page, which would reduce the number of item _ids that I have to cache, but deleting after seemed a safer alternative, at least for this prototype.
I'd find the top N documents, put them into a new index and then delete the old one.
This is much simpler and you can use aliases to make things transparent for your application.
That would be possible too, however there are many other documents in that index which do not match that delete query.
I find that a lot of proposed solutions to harder Elasticsearch problems hover around the idea that you should create another index. That, of course, is a legitimate solution, however in many cases this would lead to an explosion of micro-indexes, which would then have to be managed somehow and aggregated for larger queries. At this point I find it easier to limit the number of indexes and manage the data creating multiple types within an index. However it looks that even that freedom is going away in 2.0 where the fields with the same names in different types must have the same data type.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.