I have come upon an interesting problem with pagination that I was
wondering if anyone else solved elegantly. The problem can best be
described by twitter's dev docs:
https://dev.twitter.com/rest/public/timelines.
Essentially, using the from and size parameters
(http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-from-size.html)
makes it very hard to get the correct documents for results page two if a
document(s) has been added since page one was loaded and the index is
sorted from newest to oldest. Twitter suggests summing the offset or from
param with the number of additional documents added since the previous
request; however, with this solution we're reliant on the client having the
correct count of documents added since the first page was loaded.
For example the following index contains documents sorted from newest to
oldest:
E (newest)
D
C
B
A (oldest)
If each page has a single document the first page will have document E and
the offset or from parameter for the next page will be 1 with expectations
of getting document D on the second page (since there is one document per
page); however, since the first page has loaded document G was added to the
index.
Now the index looks like this:
G (newest)
E
D
C
B
A (oldest)
Using the offset or from parameter of 1 in this case will return document
E... Again. This is NOT the intended functionality and would lead to
duplicate documents being returned.
The only solution I've come up with doesn't seem ideal.
For the first page I'll perform the same actions as the example above.
Except in addition to returning document E the total number of documents
in the index will be returned. In the case of index E through A that would
be 5 total documents. Attempting to access additional pages thereafter
will require providing the total number of documents obtained with the
first request. Let's call that startSize. Otherwise we'll still have to
pass the offset of 1. On the second and all requests thereafter we'll
invert our the sorting of the documents to be oldest to newest.
The inverted index will look like this:
A (oldest)
B
C
D
E
G (newest)
The amount of documents per page will be referred to as pageSize (or size
param in ES). The from parameter will be calculated using the following
formula:
from = startSize - offset - pageSize
= 5 - 1 - 1
= 3
while
size = pageSize
= 1
Using the inverted index and the calculated parameters that will give us
document D or the expected result for page two prior to document G being
added to the index. On page 3 we'll get document C etc.
That formula will give us the expected results when working with indices
that are: sorted from newest to oldest, are constantly growing, and are
accessed with pagination. I don't see this algorithm significantly
increasing the cost of accessing the API but with that said I cannot help
but think I've let the early hours of the morning get the best of me.
Is there a better solution or something built into elasticsearch to handle
this use case?
Thanks in advance!
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/61c3dbce-6383-4270-91f4-acfa23ffa2f7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.