We noticed that the number of documents returned starts with 2000 and
slowly starts decreasing in batch of 100
2000
....
...
...
1900
...
...
1800
...
...
I understand that the decrease is expected as shards start to run out of
records to return. But I would expect that it would be more like 1938 ...
1857....1732 and so on. Its very unlikely that for all (almost) shards
number of documents %100 = 0
Though towards the end we do see
767 562 490 297
Before these, all the number as multiples of 100.
Anyhow, the main problem is that the total number of records returned
before we get 0 returned records is ~3.7M, though for each iteration it
claims that the total number of records is ~8M
Its puzzling why it fails to return all 8M records and just returns less
than half of them. And it does know about all 8M records, even till the
last iteration.
I'd have assumed that there is a bug in our code, but it works fine for few
thousands of records.
Not sure if it matters, but the whole process takes more than a day on our
cluster in EC2 (we do bunch of processing on each record before pushing it
back in the index). And we are using 0.18.5. Yes we are planning to upgrade
to 0.19.x, but we need to get it working right now.
We noticed that the number of documents returned starts with 2000 and
slowly starts decreasing in batch of 100
2000
....
...
...
1900
...
...
1800
...
...
I understand that the decrease is expected as shards start to run out of
records to return. But I would expect that it would be more like 1938 ...
1857....1732 and so on. Its very unlikely that for all (almost) shards
number of documents %100 = 0
Though towards the end we do see
767 562 490 297
Before these, all the number as multiples of 100.
Anyhow, the main problem is that the total number of records returned
before we get 0 returned records is ~3.7M, though for each iteration it
claims that the total number of records is ~8M
Its puzzling why it fails to return all 8M records and just returns less
than half of them. And it does know about all 8M records, even till the
last iteration.
I'd have assumed that there is a bug in our code, but it works fine for
few thousands of records.
Not sure if it matters, but the whole process takes more than a day on our
cluster in EC2 (we do bunch of processing on each record before pushing it
back in the index). And we are using 0.18.5. Yes we are planning to upgrade
to 0.19.x, but we need to get it working right now.
We noticed that the number of documents returned starts with 2000 and
slowly starts decreasing in batch of 100
2000
....
...
...
1900
...
...
1800
...
...
I understand that the decrease is expected as shards start to run out of
records to return. But I would expect that it would be more like 1938 ...
1857....1732 and so on. Its very unlikely that for all (almost) shards
number of documents %100 = 0
Though towards the end we do see
767 562 490 297
Before these, all the number as multiples of 100.
Anyhow, the main problem is that the total number of records returned
before we get 0 returned records is ~3.7M, though for each iteration it
claims that the total number of records is ~8M
Its puzzling why it fails to return all 8M records and just returns less
than half of them. And it does know about all 8M records, even till the
last iteration.
I'd have assumed that there is a bug in our code, but it works fine for
few thousands of records.
Not sure if it matters, but the whole process takes more than a day on
our cluster in EC2 (we do bunch of processing on each record before pushing
it back in the index). And we are using 0.18.5. Yes we are planning to
upgrade to 0.19.x, but we need to get it working right now.
We noticed that the number of documents returned starts with 2000 and
slowly starts decreasing in batch of 100
2000
....
...
...
1900
...
...
1800
...
...
I understand that the decrease is expected as shards start to run out of
records to return. But I would expect that it would be more like 1938 ...
1857....1732 and so on. Its very unlikely that for all (almost) shards
number of documents %100 = 0
Though towards the end we do see
767 562 490 297
Before these, all the number as multiples of 100.
Anyhow, the main problem is that the total number of records returned
before we get 0 returned records is ~3.7M, though for each iteration it
claims that the total number of records is ~8M
Its puzzling why it fails to return all 8M records and just returns less
than half of them. And it does know about all 8M records, even till the
last iteration.
I'd have assumed that there is a bug in our code, but it works fine for
few thousands of records.
Not sure if it matters, but the whole process takes more than a day on
our cluster in EC2 (we do bunch of processing on each record before pushing
it back in the index). And we are using 0.18.5. Yes we are planning to
upgrade to 0.19.x, but we need to get it working right now.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.