I am trying to understand the Pros and Cons of using scroll vs from/size
for pagination? Scroll document advises against using it for real time
users but it doesn't say why.
Also, what are the disadvantages of using from/size? I read that pagination
is not efficient because it always pulls the top sorted result in memory.
I am trying to understand the Pros and Cons of using scroll vs from/size
for pagination? Scroll document advises against using it for real time
users but it doesn't say why.
I believe scrolls pin some state in memory until they expire or you scroll
all the way through them. They are great for maintenance scripts that need
to get all the results but you don't want too many in flight at once. I
think.
Also, what are the disadvantages of using from/size? I read that
pagination is not efficient because it always pulls the top sorted result
in memory.
This one is easy. Elasticsearch/lucene has to keep a min heap of all the
documents you find and the score that is from + size big. Technically it
is min(from + size, max(rescore_window_size)). Anyway, that means some
part of the query has O(n) space and O(n * log(n)) time complexity where n
is from + size. That part might be dwarfed by some other action but it is
there. And technically in the worst case the time complexity is more like
O(hits * log(n)) but thats not likely.
On Thu, Apr 10, 2014 at 11:13 PM, Nikolas Everett nik9000@gmail.com wrote:
This one is easy. Elasticsearch/lucene has to keep a min heap of all the
documents you find and the score that is from + size big. Technically it
is min(from + size, max(rescore_window_size)). Anyway, that means some
part of the query has O(n) space and O(n * log(n)) time complexity where n
is from + size. That part might be dwarfed by some other action but it is
there. And technically in the worst case the time complexity is more like
O(hits * log(n)) but thats not likely.
Everything that Nikolas said is correct. I'd like to add that starting with
Elasticsearch 1.2.0, paging with scroll is going to be more efficient[1]
since the worst case will be O(hits * log(size)) instead of O(hits *
log(from + size)). If you are interested in why it is possible, the reason
is that on each shard, scroll is going to keep track of the least document
that is part of the hits of the previous page, so that you can just ignore
documents that compare greater than this document instead of adding them to
the priority queue.
The issue with realtime is that it creates lots of segments that usually
get merged very quickly. On the other hand, scroll works by asking the
shard to keep open the view over the index that was used for the first
page, until the scroll is closed. This can delay space reclamation and
force Elasticsearch to keep a significant number of files open (beware of
going out of file descriptors).
If you have important search traffic, I would recommend not to use scroll
for every user because of its cost. It is usually a better idea to just
increase the from parameter and prevent your users from performing deep
paging since it might kill your cluster. (If you go to any web search
engine, you'll see that even if they tell us that your query matched
millions of documents, they only allow you to get hits for a few tens of
pages.)
On Thu, Apr 10, 2014 at 11:13 PM, Nikolas Everett nik9000@gmail.comwrote:
This one is easy. Elasticsearch/lucene has to keep a min heap of all the
documents you find and the score that is from + size big. Technically it
is min(from + size, max(rescore_window_size)). Anyway, that means some
part of the query has O(n) space and O(n * log(n)) time complexity where n
is from + size. That part might be dwarfed by some other action but it is
there. And technically in the worst case the time complexity is more like
O(hits * log(n)) but thats not likely.
Everything that Nikolas said is correct. I'd like to add that starting
with Elasticsearch 1.2.0, paging with scroll is going to be more
efficient[1] since the worst case will be O(hits * log(size)) instead of
O(hits * log(from + size)). If you are interested in why it is possible,
the reason is that on each shard, scroll is going to keep track of the
least document that is part of the hits of the previous page, so that you
can just ignore documents that compare greater than this document instead
of adding them to the priority queue.
The issue with realtime is that it creates lots of segments that usually
get merged very quickly. On the other hand, scroll works by asking the
shard to keep open the view over the index that was used for the first
page, until the scroll is closed. This can delay space reclamation and
force Elasticsearch to keep a significant number of files open (beware of
going out of file descriptors).
If you have important search traffic, I would recommend not to use scroll
for every user because of its cost. It is usually a better idea to just
increase the from parameter and prevent your users from performing deep
paging since it might kill your cluster. (If you go to any web search
engine, you'll see that even if they tell us that your query matched
millions of documents, they only allow you to get hits for a few tens of
pages.)
On Thu, Apr 10, 2014 at 11:13 PM, Nikolas Everett nik9000@gmail.comwrote:
This one is easy. Elasticsearch/lucene has to keep a min heap of all
the documents you find and the score that is from + size big. Technically
it is min(from + size, max(rescore_window_size)). Anyway, that means some
part of the query has O(n) space and O(n * log(n)) time complexity where n
is from + size. That part might be dwarfed by some other action but it is
there. And technically in the worst case the time complexity is more like
O(hits * log(n)) but thats not likely.
Everything that Nikolas said is correct. I'd like to add that starting
with Elasticsearch 1.2.0, paging with scroll is going to be more
efficient[1] since the worst case will be O(hits * log(size)) instead of
O(hits * log(from + size)). If you are interested in why it is possible,
the reason is that on each shard, scroll is going to keep track of the
least document that is part of the hits of the previous page, so that you
can just ignore documents that compare greater than this document instead
of adding them to the priority queue.
The issue with realtime is that it creates lots of segments that usually
get merged very quickly. On the other hand, scroll works by asking the
shard to keep open the view over the index that was used for the first
page, until the scroll is closed. This can delay space reclamation and
force Elasticsearch to keep a significant number of files open (beware of
going out of file descriptors).
If you have important search traffic, I would recommend not to use scroll
for every user because of its cost. It is usually a better idea to just
increase the from parameter and prevent your users from performing deep
paging since it might kill your cluster. (If you go to any web search
engine, you'll see that even if they tell us that your query matched
millions of documents, they only allow you to get hits for a few tens of
pages.)
Elasticsearch exposes the total number of hits in the search responses,
let's call it T. So if your page size is P, you know that there are ceil(T / P) pages.
On Thu, Apr 10, 2014 at 11:13 PM, Nikolas Everett nik9000@gmail.comwrote:
This one is easy. Elasticsearch/lucene has to keep a min heap of all
the documents you find and the score that is from + size big. Technically
it is min(from + size, max(rescore_window_size)). Anyway, that means some
part of the query has O(n) space and O(n * log(n)) time complexity where n
is from + size. That part might be dwarfed by some other action but it is
there. And technically in the worst case the time complexity is more like
O(hits * log(n)) but thats not likely.
Everything that Nikolas said is correct. I'd like to add that starting
with Elasticsearch 1.2.0, paging with scroll is going to be more
efficient[1] since the worst case will be O(hits * log(size)) instead of
O(hits * log(from + size)). If you are interested in why it is possible,
the reason is that on each shard, scroll is going to keep track of the
least document that is part of the hits of the previous page, so that you
can just ignore documents that compare greater than this document instead
of adding them to the priority queue.
The issue with realtime is that it creates lots of segments that usually
get merged very quickly. On the other hand, scroll works by asking the
shard to keep open the view over the index that was used for the first
page, until the scroll is closed. This can delay space reclamation and
force Elasticsearch to keep a significant number of files open (beware of
going out of file descriptors).
If you have important search traffic, I would recommend not to use
scroll for every user because of its cost. It is usually a better idea to
just increase the from parameter and prevent your users from performing
deep paging since it might kill your cluster. (If you go to any web search
engine, you'll see that even if they tell us that your query matched
millions of documents, they only allow you to get hits for a few tens of
pages.)
Elasticsearch exposes the total number of hits in the search responses,
let's call it T. So if your page size is P, you know that there are ceil(T / P) pages.
On Thu, Apr 10, 2014 at 11:13 PM, Nikolas Everett nik9000@gmail.comwrote:
This one is easy. Elasticsearch/lucene has to keep a min heap of all
the documents you find and the score that is from + size big. Technically
it is min(from + size, max(rescore_window_size)). Anyway, that means some
part of the query has O(n) space and O(n * log(n)) time complexity where n
is from + size. That part might be dwarfed by some other action but it is
there. And technically in the worst case the time complexity is more like
O(hits * log(n)) but thats not likely.
Everything that Nikolas said is correct. I'd like to add that starting
with Elasticsearch 1.2.0, paging with scroll is going to be more
efficient[1] since the worst case will be O(hits * log(size)) instead of
O(hits * log(from + size)). If you are interested in why it is possible,
the reason is that on each shard, scroll is going to keep track of the
least document that is part of the hits of the previous page, so that you
can just ignore documents that compare greater than this document instead
of adding them to the priority queue.
The issue with realtime is that it creates lots of segments that
usually get merged very quickly. On the other hand, scroll works by asking
the shard to keep open the view over the index that was used for the first
page, until the scroll is closed. This can delay space reclamation and
force Elasticsearch to keep a significant number of files open (beware of
going out of file descriptors).
If you have important search traffic, I would recommend not to use
scroll for every user because of its cost. It is usually a better idea to
just increase the from parameter and prevent your users from performing
deep paging since it might kill your cluster. (If you go to any web search
engine, you'll see that even if they tell us that your query matched
millions of documents, they only allow you to get hits for a few tens of
pages.)
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.