Scan over 1Mio records get's slower and slower


(Chris-3) #1

Hi list,

just wondering if the scan search type is supposed to get slower when
reading like 1000 times 5000 records, as this is what i'm seeing.
The time needed to get the next resultset roughly doubles after every
second resultset (and of course reaches the timeout before i get all
documents).

I'm running ES 0.18.4 on a single node, 2 shards, no replicas, with
around 1000 types inside a single index (each with 40 to 60 fields),
10mio rows totalling in 40gb, while scanning by type.

I suspect it's my own fault, but a short yes/no (or some pointer)
would help, thanks.

Greets, Chris


(Shay Banon) #2

Its not your fault, it will take longer as you scan further into the
resultset.

On Mon, Nov 21, 2011 at 6:06 PM, Chris pc@matt-schwarz.com wrote:

Hi list,

just wondering if the scan search type is supposed to get slower when
reading like 1000 times 5000 records, as this is what i'm seeing.
The time needed to get the next resultset roughly doubles after every
second resultset (and of course reaches the timeout before i get all
documents).

I'm running ES 0.18.4 on a single node, 2 shards, no replicas, with
around 1000 types inside a single index (each with 40 to 60 fields),
10mio rows totalling in 40gb, while scanning by type.

I suspect it's my own fault, but a short yes/no (or some pointer)
would help, thanks.

Greets, Chris


(Clinton Gormley) #3

On Mon, 2011-11-21 at 20:42 +0200, Shay Banon wrote:

Its not your fault, it will take longer as you scan further into the
resultset.

For a scrolled 'scan' search? I thought the point of a scan (ie not
being sorted) was that it was an efficient way to retrieve all docs?

clint

On Mon, Nov 21, 2011 at 6:06 PM, Chris pc@matt-schwarz.com wrote:
Hi list,

    just wondering if the scan search type is supposed to get
    slower when
    reading like 1000 times 5000 records, as this is what i'm
    seeing.
    The time needed to get the next resultset roughly doubles
    after every
    second resultset (and of course reaches the timeout before i
    get all
    documents).
    
    I'm running ES 0.18.4 on a single node, 2 shards, no replicas,
    with
    around 1000 types inside a single index (each with 40 to 60
    fields),
    10mio rows totalling in 40gb, while scanning by type.
    
    I suspect it's my own fault, but a short yes/no (or some
    pointer)
    would help, thanks.
    
    Greets, Chris

(Shay Banon) #4

It is efficient, certainly compared to when you do sorting, but, there is
still an overhead as you scroll "deeper".

On Tue, Nov 22, 2011 at 12:47 PM, Clinton Gormley clint@traveljury.comwrote:

On Mon, 2011-11-21 at 20:42 +0200, Shay Banon wrote:

Its not your fault, it will take longer as you scan further into the
resultset.

For a scrolled 'scan' search? I thought the point of a scan (ie not
being sorted) was that it was an efficient way to retrieve all docs?

clint

On Mon, Nov 21, 2011 at 6:06 PM, Chris pc@matt-schwarz.com wrote:
Hi list,

    just wondering if the scan search type is supposed to get
    slower when
    reading like 1000 times 5000 records, as this is what i'm
    seeing.
    The time needed to get the next resultset roughly doubles
    after every
    second resultset (and of course reaches the timeout before i
    get all
    documents).

    I'm running ES 0.18.4 on a single node, 2 shards, no replicas,
    with
    around 1000 types inside a single index (each with 40 to 60
    fields),
    10mio rows totalling in 40gb, while scanning by type.

    I suspect it's my own fault, but a short yes/no (or some
    pointer)
    would help, thanks.

    Greets, Chris

(Karussell) #5

On 22 Nov., 13:12, Shay Banon kim...@gmail.com wrote:

It is efficient, certainly compared to when you do sorting, but, there is
still an overhead as you scroll "deeper".

Yes, although I didn't have that feeling in my case with some million
documents.

Peter.

BTW: there is a new search option available in the upcoming lucene.


(Shay Banon) #6

On Wed, Nov 23, 2011 at 10:04 AM, Karussell tableyourtime@googlemail.comwrote:

On 22 Nov., 13:12, Shay Banon kim...@gmail.com wrote:

It is efficient, certainly compared to when you do sorting, but, there is
still an overhead as you scroll "deeper".

Yes, although I didn't have that feeling in my case with some million
documents.

Its not that bad, it simply does an early exit during the collection part
once enough docs have been "collected". A regular search always goes
through all of them (to sort things properly).

Peter.

BTW: there is a new search option available in the upcoming lucene.

You mean the searchAfter one? It does something similar.


#7

this is not correct. searchAfter is not an optimization, and it doesn't
early exit. it uses a fixed size priority queue and because of this, the
100 millionth page takes the same time as the first.

but you must pass the last result (bottom result from the previous page) so
that it knows which entries are 'too competitive' to enter the pq.

On Nov 23, 2011 5:13 AM, "Shay Banon" kimchy@gmail.com

You mean the searchAfter one? It does something similar.


(Shay Banon) #8

I meant in terms of cost.

On Wed, Nov 23, 2011 at 11:22 PM, Robert Muir rcmuir@gmail.com wrote:

this is not correct. searchAfter is not an optimization, and it doesn't
early exit. it uses a fixed size priority queue and because of this, the
100 millionth page takes the same time as the first.

but you must pass the last result (bottom result from the previous page)
so that it knows which entries are 'too competitive' to enter the pq.

On Nov 23, 2011 5:13 AM, "Shay Banon" kimchy@gmail.com

You mean the searchAfter one? It does something similar.


(system) #9