Scan/scrolling: many small scrolls, or one large scroll?

Matt1 · February 16, 2012, 9:08am

I was wondering, from a performance perspective (more specifically,
cranking through the data as quickly as possible), which one is better if I
wanted to scroll through a large-ish (hundred of gigabytes to a few
terabytes) index with an ordered field (e.g. all docs have a date field):

Do many small scrolls, one each for each non-overlapping interval
(again, for example, if I know beforehand that there's only 1 year of data,
then 12 scrolls, one for each month)
Go ahead and just do a normal scroll with match_all or something similar.

The reason I axk is because it was mentioned in previous posts that the
deeper you go into a scoll, the slower it gets. Would this technique
alleviate that? What are the tradeoffs? Also, would the answer change if it
was a single node cluster versus a multi-node cluster?

Thanks in advance!

Matt

kimchy · February 16, 2012, 7:51pm

First, the performance problem with "deeper" scrolling has been fixed (or greatly improved). In general, the benefit of what you suggest comes from the fact that you do things in parallel, so if you handle it on the client side in parallel as well (multiple processes / threads / machines), then do it.

On Thursday, February 16, 2012 at 11:08 AM, Matt wrote:

I was wondering, from a performance perspective (more specifically, cranking through the data as quickly as possible), which one is better if I wanted to scroll through a large-ish (hundred of gigabytes to a few terabytes) index with an ordered field (e.g. all docs have a date field):

Do many small scrolls, one each for each non-overlapping interval (again, for example, if I know beforehand that there's only 1 year of data, then 12 scrolls, one for each month)

Go ahead and just do a normal scroll with match_all or something similar.

The reason I axk is because it was mentioned in previous posts that the deeper you go into a scoll, the slower it gets. Would this technique alleviate that? What are the tradeoffs? Also, would the answer change if it was a single node cluster versus a multi-node cluster?

Thanks in advance!

Matt

Clinton_Gormley · February 16, 2012, 8:01pm

On Thu, 2012-02-16 at 21:51 +0200, Shay Banon wrote:

First, the performance problem with "deeper" scrolling has been fixed
(or greatly improved). In general, the benefit of what you suggest
comes from the fact that you do things in parallel, so if you handle
it on the client side in parallel as well (multiple processes /
threads / machines), then do it.

This performance improvement also applies to non scan requests?

clint

kimchy · February 16, 2012, 8:21pm

No, just for the scan type.

On Thursday, February 16, 2012 at 10:01 PM, Clinton Gormley wrote:

On Thu, 2012-02-16 at 21:51 +0200, Shay Banon wrote:

First, the performance problem with "deeper" scrolling has been fixed
(or greatly improved). In general, the benefit of what you suggest
comes from the fact that you do things in parallel, so if you handle
it on the client side in parallel as well (multiple processes /
threads / machines), then do it.

This performance improvement also applies to non scan requests?

clint

Matt1 · February 17, 2012, 3:53am

Got it, thanks for the response. I see the enhancement lined up for the
proper 0.19 release, will give it a try when it's out. Cheers!

Topic		Replies	Views
Scroll vs Search API Elasticsearch	7	10695	July 5, 2017
SCAN Search type behavior explanation Elasticsearch	1	337	July 6, 2017
Query Advice Needed Elasticsearch	6	311	July 6, 2017
Scan/Scroll performance and cache Elasticsearch	11	3481	July 5, 2017
Elastic Search - Scrolling for Not too many documents? Elasticsearch	2	427	March 19, 2019

Scan/scrolling: many small scrolls, or one large scroll?

Related topics