Hi! I'm new to Elasticsearch and I have a particular use case for which I don't know if I should use a basic search or a scroll search.
I have an index in which I periodically save a copy of JSON documents. Each JSON document features an ID that refers to an external data entity. Documents are inserted every day, but not for every entity's ID systematically.
At a certain date A, I want to retrieve and process the most recent entry for each existing entity ID. In other words, I want to filter documents for which the date is before date A, sort them by descending date, and group by ID for which I want to keep only the first document.
Here are the two solutions:
- Using search with a
collapse
clause on the ID, then iterating over pages until I have fetched all the documents corresponding to each ID. - Using scroll to fetch all documents whose date is less than date A sorted by descending date, then manually handling the grouping because I can't use
collapse
when using scroll.
The first solution does not seem suitable since I could deal with very large datasets for which the scroll API seems to be made for.
But scrolling does not really seem convenient given I can have hundreds of versions for each ID, dating years before date A, when I only need the most recent version.
What would be the best approach for this use case?
Thanks for any hints, maybe I'm not headed in the right direction here!