I was wondering, from a performance perspective (more specifically,
cranking through the data as quickly as possible), which one is better if I
wanted to scroll through a large-ish (hundred of gigabytes to a few
terabytes) index with an ordered field (e.g. all docs have a date field):
Do many small scrolls, one each for each non-overlapping interval
(again, for example, if I know beforehand that there's only 1 year of data,
then 12 scrolls, one for each month)
Go ahead and just do a normal scroll with match_all or something similar.
The reason I axk is because it was mentioned in previous posts that the
deeper you go into a scoll, the slower it gets. Would this technique
alleviate that? What are the tradeoffs? Also, would the answer change if it
was a single node cluster versus a multi-node cluster?
First, the performance problem with "deeper" scrolling has been fixed (or greatly improved). In general, the benefit of what you suggest comes from the fact that you do things in parallel, so if you handle it on the client side in parallel as well (multiple processes / threads / machines), then do it.
On Thursday, February 16, 2012 at 11:08 AM, Matt wrote:
I was wondering, from a performance perspective (more specifically, cranking through the data as quickly as possible), which one is better if I wanted to scroll through a large-ish (hundred of gigabytes to a few terabytes) index with an ordered field (e.g. all docs have a date field):
Do many small scrolls, one each for each non-overlapping interval (again, for example, if I know beforehand that there's only 1 year of data, then 12 scrolls, one for each month)
Go ahead and just do a normal scroll with match_all or something similar.
The reason I axk is because it was mentioned in previous posts that the deeper you go into a scoll, the slower it gets. Would this technique alleviate that? What are the tradeoffs? Also, would the answer change if it was a single node cluster versus a multi-node cluster?
On Thu, 2012-02-16 at 21:51 +0200, Shay Banon wrote:
First, the performance problem with "deeper" scrolling has been fixed
(or greatly improved). In general, the benefit of what you suggest
comes from the fact that you do things in parallel, so if you handle
it on the client side in parallel as well (multiple processes /
threads / machines), then do it.
This performance improvement also applies to non scan requests?
On Thursday, February 16, 2012 at 10:01 PM, Clinton Gormley wrote:
On Thu, 2012-02-16 at 21:51 +0200, Shay Banon wrote:
First, the performance problem with "deeper" scrolling has been fixed
(or greatly improved). In general, the benefit of what you suggest
comes from the fact that you do things in parallel, so if you handle
it on the client side in parallel as well (multiple processes /
threads / machines), then do it.
This performance improvement also applies to non scan requests?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.