Hi,
We have developed a file storage solution which uses elastic search to store metadata about files using the rest level client java API.
We have pagination currently implemented via “from” and “size”. Client makes a call to us specifying the size and can also specify a page number, we use the page number to calculate the offset or “from”.
They also are allowed sort via any field which can name from strings to dates, integers etc.
The from and size is causing is issues at the moment with deep pagination, for example (solution 1)
1. /rest/metadata/search*
1. numberOfHitsPerPage = 5000
2. from(0),size(5000)
2. /rest/metadata/search?pageNumber=2
1. numberOfHitsPerPage=5000
2. from(5000),size(5000)
3. /rest/metadata/search?pageNumber=3
1. from(10000),size(5000)
2. From + size = 15,000, which is over the index.max_result_window of 10,000 and will fail.
I have been looking into searchAfter functionality and have implemented this, so on the response we return the last “sort” index value, which client can use in subsequent calls to avoid the above issue. Example. (Solution 2)
1. /rest/metadata/search
1. numberOfHitsPerPage = 5000
2. We return the 5000 hits but also include the sort value of the last hit.
2. /rest/metadata/search?lastIndexValue=1581418484000
1. numberOfHitsPerPage=5000
2. Under the hood we then use search_after to search from 1581418484000, return the next 5000 hits and the new last index.
3. /rest/metadata/search? lastIndexValue=1581418484011
1. numberOfHitsPerPage=5000
2. Under the hood we then use search_after to search from 1581418484011, return the next 5000 hits and return the new last index.
3. There is no exception here because the filter is applied on the search request itself @ 5000 a time.
This works fine in some cases but gives us weird results also because I mentioned above we allow to sort by any field, so for example we have 100 files stored all with “extension” field set to txt and 100 set to pdf, so user does one call with size set to 10 and wants to sort by “extension”, we return these along with the last “sort” index which is “txt”, “txt” is then used in the subsequent calls for the searchAfter field but this doesn’t give any results.
So it looks like searchAfter only work’s well with fields like dates etc.
I was thinking potentially we could store the lastSorted value (index) internally, so go back to solution 1 but if from + size > 10,000 use the last sort value and it’s hidden to the client user. Only problem I see with this is where could we store the last sort value and the last sort value would need to be unique per search, I don’t potentially want a huge DB filled with all this sort values purely for this.
Thoughts?
Thanks,