Scan/scrolling: many small scrolls, or one large scroll?


(Matt) #1

I was wondering, from a performance perspective (more specifically,
cranking through the data as quickly as possible), which one is better if I
wanted to scroll through a large-ish (hundred of gigabytes to a few
terabytes) index with an ordered field (e.g. all docs have a date field):

  1. Do many small scrolls, one each for each non-overlapping interval
    (again, for example, if I know beforehand that there's only 1 year of data,
    then 12 scrolls, one for each month)
  2. Go ahead and just do a normal scroll with match_all or something similar.

The reason I axk is because it was mentioned in previous posts that the
deeper you go into a scoll, the slower it gets. Would this technique
alleviate that? What are the tradeoffs? Also, would the answer change if it
was a single node cluster versus a multi-node cluster?

Thanks in advance!

Matt


(Shay Banon) #2

First, the performance problem with "deeper" scrolling has been fixed (or greatly improved). In general, the benefit of what you suggest comes from the fact that you do things in parallel, so if you handle it on the client side in parallel as well (multiple processes / threads / machines), then do it.

On Thursday, February 16, 2012 at 11:08 AM, Matt wrote:

I was wondering, from a performance perspective (more specifically, cranking through the data as quickly as possible), which one is better if I wanted to scroll through a large-ish (hundred of gigabytes to a few terabytes) index with an ordered field (e.g. all docs have a date field):

  1. Do many small scrolls, one each for each non-overlapping interval (again, for example, if I know beforehand that there's only 1 year of data, then 12 scrolls, one for each month)
  2. Go ahead and just do a normal scroll with match_all or something similar.

The reason I axk is because it was mentioned in previous posts that the deeper you go into a scoll, the slower it gets. Would this technique alleviate that? What are the tradeoffs? Also, would the answer change if it was a single node cluster versus a multi-node cluster?

Thanks in advance!

Matt


(Clinton Gormley) #3

On Thu, 2012-02-16 at 21:51 +0200, Shay Banon wrote:

First, the performance problem with "deeper" scrolling has been fixed
(or greatly improved). In general, the benefit of what you suggest
comes from the fact that you do things in parallel, so if you handle
it on the client side in parallel as well (multiple processes /
threads / machines), then do it.

This performance improvement also applies to non scan requests?

clint


(Shay Banon) #4

No, just for the scan type.

On Thursday, February 16, 2012 at 10:01 PM, Clinton Gormley wrote:

On Thu, 2012-02-16 at 21:51 +0200, Shay Banon wrote:

First, the performance problem with "deeper" scrolling has been fixed
(or greatly improved). In general, the benefit of what you suggest
comes from the fact that you do things in parallel, so if you handle
it on the client side in parallel as well (multiple processes /
threads / machines), then do it.

This performance improvement also applies to non scan requests?

clint


(Matt) #5

Got it, thanks for the response. I see the enhancement lined up for the
proper 0.19 release, will give it a try when it's out. Cheers!


(system) #6