I am creating a system where the user can search for documents to be processed. This won't show to the user the documents, but only the amount of them. As the task on these documents will take a certain amount of time, this will be done in a background / delayed job. However, between the first search and the job that process the results, documents may have been added / deleted. As my service will charge on number of document processed; I need to "save" the first search and don't just compute another search.
This is where I thought about Scroll API. I would give a scroll like 10m (to save the scrolled search). But, if I'm not mistaken, the first page would be lost ? As it will return the first documents, right?
How would you do such a thing? I'm a bit confused. I'd like to get the number of documents I will proceed and save this search to be processed later.
Why don't you want to start immediately the scroll process in the background?
I mean that when you start scrolling the data will stay consistent whatever is indexed/deleted after you started scrolling and whatever time it takes to scroll it.
So I'd open the scroll, send back to the user the number of hits and in an asynchronous thread continue scrolling with the scroll id.
I can't process immediatly the results as the user might change some criterias on his search, so, it will invalidate the search, therefore the count and everything.
The user flow is :
As the user puts some criteria he will be shown the live amout of documents that will be impacted inside the "pay button" (ex: "Pay for 139'482 documents")
As he click on the button, we need to "snapshot" the results for the job to process them.
Maybe I am overthinking it, but I can't see how processing them immediatly would solve my problem in this case.
If you are only indexing new data and not updating existing records you could add an index-time timestamp and include the max value in your initial query. When you then initiate your scan-scroll query you could then use this to limit the results to the same set available at the time of the original count.
Thank you @Christian_Dahlqvist but, unfortunatly, documents can be updated between two searchs or even the last search and the process of the job.
English translation of what I've wrote in french in the other topic : (I did my best to translate ^^")
So, our user has an GUI where he can set some criteria for his search. As he complete the form with his criterias, we display the live amount of documents that will be processed.
After that, the user can click on the "pay" button. It is at this moment that we have to process the documents from the last search. At least, that's how I've imagined this.
However, my problem is that right now, the searchs properly tell me the live count as I set the criterias but if I set a scroll, this will give me some result back and won't be processed later in the scroll for the job...
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.