Scroll vs pagination using from/size


(Mohit Anchlia) #1

I am trying to understand the Pros and Cons of using scroll vs from/size
for pagination? Scroll document advises against using it for real time
users but it doesn't say why.

Also, what are the disadvantages of using from/size? I read that pagination
is not efficient because it always pulls the top sorted result in memory.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAOT3TWr%2BR8zEGd4DuEPfAowS0Y-1ZDz9MxHZFE_9U3UBZeEheg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Nik Everett) #2

On Thu, Apr 10, 2014 at 4:52 PM, Mohit Anchlia mohitanchlia@gmail.comwrote:

I am trying to understand the Pros and Cons of using scroll vs from/size
for pagination? Scroll document advises against using it for real time
users but it doesn't say why.

I believe scrolls pin some state in memory until they expire or you scroll
all the way through them. They are great for maintenance scripts that need
to get all the results but you don't want too many in flight at once. I
think.

Also, what are the disadvantages of using from/size? I read that
pagination is not efficient because it always pulls the top sorted result
in memory.

This one is easy. Elasticsearch/lucene has to keep a min heap of all the
documents you find and the score that is from + size big. Technically it
is min(from + size, max(rescore_window_size)). Anyway, that means some
part of the query has O(n) space and O(n * log(n)) time complexity where n
is from + size. That part might be dwarfed by some other action but it is
there. And technically in the worst case the time complexity is more like
O(hits * log(n)) but thats not likely.

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd3jWS7mbzjxpu7ED_nRoTQQJmmxYWNGNgmMuJH17-WuAA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Adrien Grand) #3

On Thu, Apr 10, 2014 at 11:13 PM, Nikolas Everett nik9000@gmail.com wrote:

This one is easy. Elasticsearch/lucene has to keep a min heap of all the
documents you find and the score that is from + size big. Technically it
is min(from + size, max(rescore_window_size)). Anyway, that means some
part of the query has O(n) space and O(n * log(n)) time complexity where n
is from + size. That part might be dwarfed by some other action but it is
there. And technically in the worst case the time complexity is more like
O(hits * log(n)) but thats not likely.

Everything that Nikolas said is correct. I'd like to add that starting with
Elasticsearch 1.2.0, paging with scroll is going to be more efficient[1]
since the worst case will be O(hits * log(size)) instead of O(hits *
log(from + size)). If you are interested in why it is possible, the reason
is that on each shard, scroll is going to keep track of the least document
that is part of the hits of the previous page, so that you can just ignore
documents that compare greater than this document instead of adding them to
the priority queue.

The issue with realtime is that it creates lots of segments that usually
get merged very quickly. On the other hand, scroll works by asking the
shard to keep open the view over the index that was used for the first
page, until the scroll is closed. This can delay space reclamation and
force Elasticsearch to keep a significant number of files open (beware of
going out of file descriptors).

If you have important search traffic, I would recommend not to use scroll
for every user because of its cost. It is usually a better idea to just
increase the from parameter and prevent your users from performing deep
paging since it might kill your cluster. (If you go to any web search
engine, you'll see that even if they tell us that your query matched
millions of documents, they only allow you to get hits for a few tens of
pages.)

[1] https://github.com/elasticsearch/elasticsearch/issues/4940

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j6JwVMTfHr%2BdFbqRvBWJ2%2B2zAAR6g8T9C31-gXpYN4LWQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Mohit Anchlia) #4

Thanks Adrien and Nikolas it's very helpful.

On Thu, Apr 10, 2014 at 3:19 PM, Adrien Grand <
adrien.grand@elasticsearch.com> wrote:

On Thu, Apr 10, 2014 at 11:13 PM, Nikolas Everett nik9000@gmail.comwrote:

This one is easy. Elasticsearch/lucene has to keep a min heap of all the
documents you find and the score that is from + size big. Technically it
is min(from + size, max(rescore_window_size)). Anyway, that means some
part of the query has O(n) space and O(n * log(n)) time complexity where n
is from + size. That part might be dwarfed by some other action but it is
there. And technically in the worst case the time complexity is more like
O(hits * log(n)) but thats not likely.

Everything that Nikolas said is correct. I'd like to add that starting
with Elasticsearch 1.2.0, paging with scroll is going to be more
efficient[1] since the worst case will be O(hits * log(size)) instead of
O(hits * log(from + size)). If you are interested in why it is possible,
the reason is that on each shard, scroll is going to keep track of the
least document that is part of the hits of the previous page, so that you
can just ignore documents that compare greater than this document instead
of adding them to the priority queue.

The issue with realtime is that it creates lots of segments that usually
get merged very quickly. On the other hand, scroll works by asking the
shard to keep open the view over the index that was used for the first
page, until the scroll is closed. This can delay space reclamation and
force Elasticsearch to keep a significant number of files open (beware of
going out of file descriptors).

If you have important search traffic, I would recommend not to use scroll
for every user because of its cost. It is usually a better idea to just
increase the from parameter and prevent your users from performing deep
paging since it might kill your cluster. (If you go to any web search
engine, you'll see that even if they tell us that your query matched
millions of documents, they only allow you to get hits for a few tens of
pages.)

[1] https://github.com/elasticsearch/elasticsearch/issues/4940

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j6JwVMTfHr%2BdFbqRvBWJ2%2B2zAAR6g8T9C31-gXpYN4LWQ%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CAL6Z4j6JwVMTfHr%2BdFbqRvBWJ2%2B2zAAR6g8T9C31-gXpYN4LWQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAOT3TWo6Sm9vxc-wsE6nyE3nE8w9Ke6eE3EQbeXtihYUbOskHg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Mohit Anchlia) #5

I have one more follow up question, how can one know if there are more
documents or not? This is to avoid one exta last call if possible.

On Thu, Apr 10, 2014 at 3:47 PM, Mohit Anchlia mohitanchlia@gmail.comwrote:

Thanks Adrien and Nikolas it's very helpful.

On Thu, Apr 10, 2014 at 3:19 PM, Adrien Grand <
adrien.grand@elasticsearch.com> wrote:

On Thu, Apr 10, 2014 at 11:13 PM, Nikolas Everett nik9000@gmail.comwrote:

This one is easy. Elasticsearch/lucene has to keep a min heap of all
the documents you find and the score that is from + size big. Technically
it is min(from + size, max(rescore_window_size)). Anyway, that means some
part of the query has O(n) space and O(n * log(n)) time complexity where n
is from + size. That part might be dwarfed by some other action but it is
there. And technically in the worst case the time complexity is more like
O(hits * log(n)) but thats not likely.

Everything that Nikolas said is correct. I'd like to add that starting
with Elasticsearch 1.2.0, paging with scroll is going to be more
efficient[1] since the worst case will be O(hits * log(size)) instead of
O(hits * log(from + size)). If you are interested in why it is possible,
the reason is that on each shard, scroll is going to keep track of the
least document that is part of the hits of the previous page, so that you
can just ignore documents that compare greater than this document instead
of adding them to the priority queue.

The issue with realtime is that it creates lots of segments that usually
get merged very quickly. On the other hand, scroll works by asking the
shard to keep open the view over the index that was used for the first
page, until the scroll is closed. This can delay space reclamation and
force Elasticsearch to keep a significant number of files open (beware of
going out of file descriptors).

If you have important search traffic, I would recommend not to use scroll
for every user because of its cost. It is usually a better idea to just
increase the from parameter and prevent your users from performing deep
paging since it might kill your cluster. (If you go to any web search
engine, you'll see that even if they tell us that your query matched
millions of documents, they only allow you to get hits for a few tens of
pages.)

[1] https://github.com/elasticsearch/elasticsearch/issues/4940

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j6JwVMTfHr%2BdFbqRvBWJ2%2B2zAAR6g8T9C31-gXpYN4LWQ%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CAL6Z4j6JwVMTfHr%2BdFbqRvBWJ2%2B2zAAR6g8T9C31-gXpYN4LWQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAOT3TWqnkRqD%2BoAX1W4ThSCE-%3DWtgYPqkvVUgEFXCj8iWJf2JA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Adrien Grand) #6

Elasticsearch exposes the total number of hits in the search responses,
let's call it T. So if your page size is P, you know that there are ceil(T / P) pages.

On Fri, Apr 11, 2014 at 5:48 AM, Mohit Anchlia mohitanchlia@gmail.comwrote:

I have one more follow up question, how can one know if there are more
documents or not? This is to avoid one exta last call if possible.

On Thu, Apr 10, 2014 at 3:47 PM, Mohit Anchlia mohitanchlia@gmail.comwrote:

Thanks Adrien and Nikolas it's very helpful.

On Thu, Apr 10, 2014 at 3:19 PM, Adrien Grand <
adrien.grand@elasticsearch.com> wrote:

On Thu, Apr 10, 2014 at 11:13 PM, Nikolas Everett nik9000@gmail.comwrote:

This one is easy. Elasticsearch/lucene has to keep a min heap of all
the documents you find and the score that is from + size big. Technically
it is min(from + size, max(rescore_window_size)). Anyway, that means some
part of the query has O(n) space and O(n * log(n)) time complexity where n
is from + size. That part might be dwarfed by some other action but it is
there. And technically in the worst case the time complexity is more like
O(hits * log(n)) but thats not likely.

Everything that Nikolas said is correct. I'd like to add that starting
with Elasticsearch 1.2.0, paging with scroll is going to be more
efficient[1] since the worst case will be O(hits * log(size)) instead of
O(hits * log(from + size)). If you are interested in why it is possible,
the reason is that on each shard, scroll is going to keep track of the
least document that is part of the hits of the previous page, so that you
can just ignore documents that compare greater than this document instead
of adding them to the priority queue.

The issue with realtime is that it creates lots of segments that usually
get merged very quickly. On the other hand, scroll works by asking the
shard to keep open the view over the index that was used for the first
page, until the scroll is closed. This can delay space reclamation and
force Elasticsearch to keep a significant number of files open (beware of
going out of file descriptors).

If you have important search traffic, I would recommend not to use
scroll for every user because of its cost. It is usually a better idea to
just increase the from parameter and prevent your users from performing
deep paging since it might kill your cluster. (If you go to any web search
engine, you'll see that even if they tell us that your query matched
millions of documents, they only allow you to get hits for a few tens of
pages.)

[1] https://github.com/elasticsearch/elasticsearch/issues/4940

--
Adrien Grand

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j6JwVMTfHr%2BdFbqRvBWJ2%2B2zAAR6g8T9C31-gXpYN4LWQ%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CAL6Z4j6JwVMTfHr%2BdFbqRvBWJ2%2B2zAAR6g8T9C31-gXpYN4LWQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAOT3TWqnkRqD%2BoAX1W4ThSCE-%3DWtgYPqkvVUgEFXCj8iWJf2JA%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CAOT3TWqnkRqD%2BoAX1W4ThSCE-%3DWtgYPqkvVUgEFXCj8iWJf2JA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j5ghh5iNVOqDoDtzR7E4922JqQOBis9FYG7QZsGo5%2B8Yw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Mohit Anchlia) #7

thanks!

On Fri, Apr 11, 2014 at 12:29 AM, Adrien Grand <
adrien.grand@elasticsearch.com> wrote:

Elasticsearch exposes the total number of hits in the search responses,
let's call it T. So if your page size is P, you know that there are ceil(T / P) pages.

On Fri, Apr 11, 2014 at 5:48 AM, Mohit Anchlia mohitanchlia@gmail.comwrote:

I have one more follow up question, how can one know if there are more
documents or not? This is to avoid one exta last call if possible.

On Thu, Apr 10, 2014 at 3:47 PM, Mohit Anchlia mohitanchlia@gmail.comwrote:

Thanks Adrien and Nikolas it's very helpful.

On Thu, Apr 10, 2014 at 3:19 PM, Adrien Grand <
adrien.grand@elasticsearch.com> wrote:

On Thu, Apr 10, 2014 at 11:13 PM, Nikolas Everett nik9000@gmail.comwrote:

This one is easy. Elasticsearch/lucene has to keep a min heap of all
the documents you find and the score that is from + size big. Technically
it is min(from + size, max(rescore_window_size)). Anyway, that means some
part of the query has O(n) space and O(n * log(n)) time complexity where n
is from + size. That part might be dwarfed by some other action but it is
there. And technically in the worst case the time complexity is more like
O(hits * log(n)) but thats not likely.

Everything that Nikolas said is correct. I'd like to add that starting
with Elasticsearch 1.2.0, paging with scroll is going to be more
efficient[1] since the worst case will be O(hits * log(size)) instead of
O(hits * log(from + size)). If you are interested in why it is possible,
the reason is that on each shard, scroll is going to keep track of the
least document that is part of the hits of the previous page, so that you
can just ignore documents that compare greater than this document instead
of adding them to the priority queue.

The issue with realtime is that it creates lots of segments that
usually get merged very quickly. On the other hand, scroll works by asking
the shard to keep open the view over the index that was used for the first
page, until the scroll is closed. This can delay space reclamation and
force Elasticsearch to keep a significant number of files open (beware of
going out of file descriptors).

If you have important search traffic, I would recommend not to use
scroll for every user because of its cost. It is usually a better idea to
just increase the from parameter and prevent your users from performing
deep paging since it might kill your cluster. (If you go to any web search
engine, you'll see that even if they tell us that your query matched
millions of documents, they only allow you to get hits for a few tens of
pages.)

[1] https://github.com/elasticsearch/elasticsearch/issues/4940

--
Adrien Grand

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j6JwVMTfHr%2BdFbqRvBWJ2%2B2zAAR6g8T9C31-gXpYN4LWQ%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CAL6Z4j6JwVMTfHr%2BdFbqRvBWJ2%2B2zAAR6g8T9C31-gXpYN4LWQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAOT3TWqnkRqD%2BoAX1W4ThSCE-%3DWtgYPqkvVUgEFXCj8iWJf2JA%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CAOT3TWqnkRqD%2BoAX1W4ThSCE-%3DWtgYPqkvVUgEFXCj8iWJf2JA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j5ghh5iNVOqDoDtzR7E4922JqQOBis9FYG7QZsGo5%2B8Yw%40mail.gmail.comhttps://groups.google.com/d/msgid/elasticsearch/CAL6Z4j5ghh5iNVOqDoDtzR7E4922JqQOBis9FYG7QZsGo5%2B8Yw%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAOT3TWpRVuOwR%3DtPAJuOtfhcwAQs_0LcRp4iBq1zJHhGUQ_82Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #8