Scan search type returning fewer than expected records

Gaurav_Vijayvargiya · June 29, 2012, 9:49pm

Hi

We are trying to read documents using scan.

/_search?search_type=scan&scroll=10m&size=100 -d '{"query" : { "match_all"
: {}}}'

we have 20 shards and ~8M documents

/_search/scroll?scroll=10m -d ''

We noticed that the number of documents returned starts with 2000 and
slowly starts decreasing in batch of 100

2000
....
...
...
1900
...
...
1800
...
...

I understand that the decrease is expected as shards start to run out of
records to return. But I would expect that it would be more like 1938 ...
1857....1732 and so on. Its very unlikely that for all (almost) shards
number of documents %100 = 0

Though towards the end we do see

767 562 490 297

Before these, all the number as multiples of 100.

Anyhow, the main problem is that the total number of records returned
before we get 0 returned records is ~3.7M, though for each iteration it
claims that the total number of records is ~8M
Its puzzling why it fails to return all 8M records and just returns less
than half of them. And it does know about all 8M records, even till the
last iteration.
I'd have assumed that there is a bug in our code, but it works fine for few
thousands of records.

Not sure if it matters, but the whole process takes more than a day on our
cluster in EC2 (we do bunch of processing on each record before pushing it
back in the index). And we are using 0.18.5. Yes we are planning to upgrade
to 0.19.x, but we need to get it working right now.

Please help.

Gaurav

kimchy · June 29, 2012, 11:16pm

Thats strange... . Which version are you using? Can you simulate it with a
smaller dataset and maybe post the recreation?

On Fri, Jun 29, 2012 at 11:49 PM, Gaurav Vijayvargiya <
gvijayvargiya@gmail.com> wrote:

Hi

We are trying to read documents using scan.

/_search?search_type=scan&scroll=10m&size=100 -d '{"query" : { "match_all"
: {}}}'

we have 20 shards and ~8M documents

/_search/scroll?scroll=10m -d ''

We noticed that the number of documents returned starts with 2000 and
slowly starts decreasing in batch of 100

2000
....
...
...
1900
...
...
1800
...
...

I understand that the decrease is expected as shards start to run out of
records to return. But I would expect that it would be more like 1938 ...
1857....1732 and so on. Its very unlikely that for all (almost) shards
number of documents %100 = 0

Though towards the end we do see

767 562 490 297

Before these, all the number as multiples of 100.

Anyhow, the main problem is that the total number of records returned
before we get 0 returned records is ~3.7M, though for each iteration it
claims that the total number of records is ~8M
Its puzzling why it fails to return all 8M records and just returns less
than half of them. And it does know about all 8M records, even till the
last iteration.
I'd have assumed that there is a bug in our code, but it works fine for
few thousands of records.

Not sure if it matters, but the whole process takes more than a day on our
cluster in EC2 (we do bunch of processing on each record before pushing it
back in the index). And we are using 0.18.5. Yes we are planning to upgrade
to 0.19.x, but we need to get it working right now.

Please help.

Gaurav

Gaurav_Vijayvargiya · June 29, 2012, 11:34pm

We are using 0.18.5.

I tried with ~6k records and it worked ok. I'll try with bigger dataset and
let you know.

On Friday, June 29, 2012 4:16:14 PM UTC-7, kimchy wrote:

Thats strange... . Which version are you using? Can you simulate it with a
smaller dataset and maybe post the recreation?

On Fri, Jun 29, 2012 at 11:49 PM, Gaurav Vijayvargiya wrote:

Hi

We are trying to read documents using scan.

/_search?search_type=scan&scroll=10m&size=100 -d '{"query" : {
"match_all" : {}}}'

we have 20 shards and ~8M documents

/_search/scroll?scroll=10m -d ''

We noticed that the number of documents returned starts with 2000 and
slowly starts decreasing in batch of 100

2000
....
...
...
1900
...
...
1800
...
...

I understand that the decrease is expected as shards start to run out of
records to return. But I would expect that it would be more like 1938 ...
1857....1732 and so on. Its very unlikely that for all (almost) shards
number of documents %100 = 0

Though towards the end we do see

767 562 490 297

Before these, all the number as multiples of 100.

Anyhow, the main problem is that the total number of records returned
before we get 0 returned records is ~3.7M, though for each iteration it
claims that the total number of records is ~8M
Its puzzling why it fails to return all 8M records and just returns less
than half of them. And it does know about all 8M records, even till the
last iteration.
I'd have assumed that there is a bug in our code, but it works fine for
few thousands of records.

Not sure if it matters, but the whole process takes more than a day on
our cluster in EC2 (we do bunch of processing on each record before pushing
it back in the index). And we are using 0.18.5. Yes we are planning to
upgrade to 0.19.x, but we need to get it working right now.

Please help.

Gaurav

kimchy · June 29, 2012, 11:38pm

Can you first upgrade to 0.19.7 and check?

On Sat, Jun 30, 2012 at 1:34 AM, Gaurav Vijayvargiya <
gvijayvargiya@gmail.com> wrote:

We are using 0.18.5.

I tried with ~6k records and it worked ok. I'll try with bigger dataset
and let you know.

On Friday, June 29, 2012 4:16:14 PM UTC-7, kimchy wrote:

Thats strange... . Which version are you using? Can you simulate it with
a smaller dataset and maybe post the recreation?

On Fri, Jun 29, 2012 at 11:49 PM, Gaurav Vijayvargiya wrote:

Hi

We are trying to read documents using scan.

/_search?search_type=scan&**scroll=10m&size=100 -d '{"query" : {
"match_all" : {}}}'

we have 20 shards and ~8M documents

/_search/scroll?scroll=10m -d ''

We noticed that the number of documents returned starts with 2000 and
slowly starts decreasing in batch of 100

2000
....
...
...
1900
...
...
1800
...
...

I understand that the decrease is expected as shards start to run out of
records to return. But I would expect that it would be more like 1938 ...
1857....1732 and so on. Its very unlikely that for all (almost) shards
number of documents %100 = 0

Though towards the end we do see

767 562 490 297

Before these, all the number as multiples of 100.

Anyhow, the main problem is that the total number of records returned
before we get 0 returned records is ~3.7M, though for each iteration it
claims that the total number of records is ~8M
Its puzzling why it fails to return all 8M records and just returns less
than half of them. And it does know about all 8M records, even till the
last iteration.
I'd have assumed that there is a bug in our code, but it works fine for
few thousands of records.

Not sure if it matters, but the whole process takes more than a day on
our cluster in EC2 (we do bunch of processing on each record before pushing
it back in the index). And we are using 0.18.5. Yes we are planning to
upgrade to 0.19.x, but we need to get it working right now.

Please help.

Gaurav