Bug when scrolling?

Hi,

I simply want a functionality to reindex my docs currently indexed.
Has someone some code for this :slight_smile: ?

There is a problem either in my code or in ES when doing some
'overwriting' updates and then scrolling. Please see this out-of-the-
box runnable gist:

Look at 'showStrangeBehaviour' and set it to true. Why are not all 200
documents returned???

Regards,
Peter.

the problem does not occur for

put("index.number_of_shards", 1)

but does for shards>1

On 8 Feb., 21:27, Karussell tableyourt...@googlemail.com wrote:

Hi,

I simply want a functionality to reindex my docs currently indexed.
Has someone some code for this :slight_smile: ?

There is a problem either in my code or in ES when doing some
'overwriting' updates and then scrolling. Please see this out-of-the-
box runnable gist:

Something goes wrong when doing bulkUpdate and retrieving · GitHub

Look at 'showStrangeBehaviour' and set it to true. Why are not all 200
documents returned???

Regards,
Peter.

Heya,

I will have a look. A note on your usage of XContentBuilder, when using XContentFactory#jsonBuilder, a cached version is returned, and it is expected that you will pass hte result to an API before you call it again on the same thread. So, in your case, create and add it directly to a BulkRequestBuilder.

I have just added safeJsonBuilder which will return one that is not cached per thread, but you should use the above.

Also, scrolling is not the best way to reindex the data, as it can get pretty expensive as you page through the hit list. I am planning to add a "scan" type search that will not incur this overhead (but will also do no sorting).

-shay.banon
On Tuesday, February 8, 2011 at 10:29 PM, Karussell wrote:

the problem does not occur for

put("index.number_of_shards", 1)

but does for shards>1

On 8 Feb., 21:27, Karussell tableyourt...@googlemail.com wrote:

Hi,

I simply want a functionality to reindex my docs currently indexed.
Has someone some code for this :slight_smile: ?

There is a problem either in my code or in ES when doing some
'overwriting' updates and then scrolling. Please see this out-of-the-
box runnable gist:

Something goes wrong when doing bulkUpdate and retrieving · GitHub

Look at 'showStrangeBehaviour' and set it to true. Why are not all 200
documents returned???

Regards,
Peter.

So, in your case, create and add it directly to a BulkRequestBuilder.

ok. what's so expensive to create that jsonBuilder?

safeJsonBuilder

do you mean?
XContentBuilder docBuilder =
JsonXContent.unCachedContentBuilder().startObject();

Also, scrolling is not the best way to reindex the data

so, how would you do this at the moment. would you use simple search
(e.g. with a date filter) instead scrolling?

Regards,
Peter.

On 8 Feb., 22:44, Shay Banon shay.ba...@elasticsearch.com wrote:

Heya,

I will have a look. A note on your usage of XContentBuilder, when using XContentFactory#jsonBuilder, a cached version is returned, and it is expected that you will pass hte result to an API before you call it again on the same thread. So, in your case, create and add it directly to a BulkRequestBuilder.

I have just added safeJsonBuilder which will return one that is not cached per thread, but you should use the above.

Also, scrolling is not the best way to reindex the data, as it can get pretty expensive as you page through the hit list. I am planning to add a "scan" type search that will not incur this overhead (but will also do no sorting).

-shay.banon

On Tuesday, February 8, 2011 at 10:29 PM, Karussell wrote:

the problem does not occur for

put("index.number_of_shards", 1)

but does for shards>1

On 8 Feb., 21:27, Karussell tableyourt...@googlemail.com wrote:

Hi,

I simply want a functionality to reindex my docs currently indexed.
Has someone some code for this :slight_smile: ?

There is a problem either in my code or in ES when doing some
'overwriting' updates and then scrolling. Please see this out-of-the-
box runnable gist:

Something goes wrong when doing bulkUpdate and retrieving · GitHub

Look at 'showStrangeBehaviour' and set it to true. Why are not all 200
documents returned???

Regards,
Peter.

On Wednesday, February 9, 2011 at 12:07 AM, Karussell wrote:
So, in your case, create and add it directly to a BulkRequestBuilder.

ok. what's so expensive to create that jsonBuilder?
Its not a question of expensive, but the byte buffer that is behind it is being reused on a thread local to reduce byte copying.

safeJsonBuilder

do you mean?
XContentBuilder docBuilder =
JsonXContent.unCachedContentBuilder().startObject();
Yea, that is there as well. I added safeJsonBuilder that delegates to that one.

Also, scrolling is not the best way to reindex the data

so, how would you do this at the moment. would you use simple search
(e.g. with a date filter) instead scrolling?
You will have the same problem with search and doing pagination using from / size. Thats why the scan type is needed. The overhead does not come from the scrolling implementation, but from how search is executed when it needs to do a sort in lucene, and when its distributed.

Regards,
Peter.

On 8 Feb., 22:44, Shay Banon shay.ba...@elasticsearch.com wrote:

Heya,

I will have a look. A note on your usage of XContentBuilder, when using XContentFactory#jsonBuilder, a cached version is returned, and it is expected that you will pass hte result to an API before you call it again on the same thread. So, in your case, create and add it directly to a BulkRequestBuilder.

I have just added safeJsonBuilder which will return one that is not cached per thread, but you should use the above.

Also, scrolling is not the best way to reindex the data, as it can get pretty expensive as you page through the hit list. I am planning to add a "scan" type search that will not incur this overhead (but will also do no sorting).

-shay.banon

On Tuesday, February 8, 2011 at 10:29 PM, Karussell wrote:

the problem does not occur for

put("index.number_of_shards", 1)

but does for shards>1

On 8 Feb., 21:27, Karussell tableyourt...@googlemail.com wrote:

Hi,

I simply want a functionality to reindex my docs currently indexed.
Has someone some code for this :slight_smile: ?

There is a problem either in my code or in ES when doing some
'overwriting' updates and then scrolling. Please see this out-of-the-
box runnable gist:

Something goes wrong when doing bulkUpdate and retrieving · GitHub

Look at 'showStrangeBehaviour' and set it to true. Why are not all 200
documents returned???

Regards,
Peter.

Hi Shay,

thanks for your answers! Do you know where the flaw could come from? I
would like to investigate/fix it as I'm unsure if reindexing works
even with this flaw ...

you will have the same problem with search and doing pagination using from / size

Does it mean that if I want docs from pages A..B then ES has to use
memory for 0..B ?

Could I overcome this problem with additional filter queries if I know
how to split up my data - e.g. based on date?

Would the scan type be without scoring (I don't need that)?

Regards,
Peter.

On 8 Feb., 23:23, Shay Banon shay.ba...@elasticsearch.com wrote:

On Wednesday, February 9, 2011 at 12:07 AM, Karussell wrote:

So, in your case, create and add it directly to a BulkRequestBuilder.

ok. what's so expensive to create that jsonBuilder?

Its not a question of expensive, but the byte buffer that is behind it is being reused on a thread local to reduce byte copying.

safeJsonBuilder

do you mean?
XContentBuilder docBuilder =
JsonXContent.unCachedContentBuilder().startObject();

Yea, that is there as well. I added safeJsonBuilder that delegates to that one.

Also, scrolling is not the best way to reindex the data

so, how would you do this at the moment. would you use simple search
(e.g. with a date filter) instead scrolling?

You will have the same problem with search and doing pagination using from / size. Thats why the scan type is needed. The overhead does not come from the scrolling implementation, but from how search is executed when it needs to do a sort in lucene, and when its distributed.

Regards,
Peter.

On 8 Feb., 22:44, Shay Banon shay.ba...@elasticsearch.com wrote:

Heya,

I will have a look. A note on your usage of XContentBuilder, when using XContentFactory#jsonBuilder, a cached version is returned, and it is expected that you will pass hte result to an API before you call it again on the same thread. So, in your case, create and add it directly to a BulkRequestBuilder.

I have just added safeJsonBuilder which will return one that is not cached per thread, but you should use the above.

Also, scrolling is not the best way to reindex the data, as it can get pretty expensive as you page through the hit list. I am planning to add a "scan" type search that will not incur this overhead (but will also do no sorting).

-shay.banon

On Tuesday, February 8, 2011 at 10:29 PM, Karussell wrote:

the problem does not occur for

put("index.number_of_shards", 1)

but does for shards>1

On 8 Feb., 21:27, Karussell tableyourt...@googlemail.com wrote:

Hi,

I simply want a functionality to reindex my docs currently indexed.
Has someone some code for this :slight_smile: ?

There is a problem either in my code or in ES when doing some
'overwriting' updates and then scrolling. Please see this out-of-the-
box runnable gist:

Something goes wrong when doing bulkUpdate and retrieving · GitHub

Look at 'showStrangeBehaviour' and set it to true. Why are not all 200
documents returned???

Regards,
Peter.

hmmh, when I'm doing an 'implicit search' (_all??) it works correctly:

// for (String fromIndex : indexList) {
String fromIndex = "";
SearchResponse rsp = client.prepareSearch().
...

On 9 Feb., 18:06, Karussell tableyourt...@googlemail.com wrote:

Hi Shay,

thanks for your answers! Do you know where the flaw could come from? I
would like to investigate/fix it as I'm unsure if reindexing works
even with this flaw ...

you will have the same problem with search and doing pagination using from / size

Does it mean that if I want docs from pages A..B then ES has to use
memory for 0..B ?

Could I overcome this problem with additional filter queries if I know
how to split up my data - e.g. based on date?

Would the scan type be without scoring (I don't need that)?

Regards,
Peter.

On 8 Feb., 23:23, Shay Banon shay.ba...@elasticsearch.com wrote:

On Wednesday, February 9, 2011 at 12:07 AM, Karussell wrote:

So, in your case, create and add it directly to a BulkRequestBuilder.

ok. what's so expensive to create that jsonBuilder?

Its not a question of expensive, but the byte buffer that is behind it is being reused on a thread local to reduce byte copying.

safeJsonBuilder

do you mean?
XContentBuilder docBuilder =
JsonXContent.unCachedContentBuilder().startObject();

Yea, that is there as well. I added safeJsonBuilder that delegates to that one.

Also, scrolling is not the best way to reindex the data

so, how would you do this at the moment. would you use simple search
(e.g. with a date filter) instead scrolling?

You will have the same problem with search and doing pagination using from / size. Thats why the scan type is needed. The overhead does not come from the scrolling implementation, but from how search is executed when it needs to do a sort in lucene, and when its distributed.

Regards,
Peter.

On 8 Feb., 22:44, Shay Banon shay.ba...@elasticsearch.com wrote:

Heya,

I will have a look. A note on your usage of XContentBuilder, when using XContentFactory#jsonBuilder, a cached version is returned, and it is expected that you will pass hte result to an API before you call it again on the same thread. So, in your case, create and add it directly to a BulkRequestBuilder.

I have just added safeJsonBuilder which will return one that is not cached per thread, but you should use the above.

Also, scrolling is not the best way to reindex the data, as it can get pretty expensive as you page through the hit list. I am planning to add a "scan" type search that will not incur this overhead (but will also do no sorting).

-shay.banon

On Tuesday, February 8, 2011 at 10:29 PM, Karussell wrote:

the problem does not occur for

put("index.number_of_shards", 1)

but does for shards>1

On 8 Feb., 21:27, Karussell tableyourt...@googlemail.com wrote:

Hi,

I simply want a functionality to reindex my docs currently indexed.
Has someone some code for this :slight_smile: ?

There is a problem either in my code or in ES when doing some
'overwriting' updates and then scrolling. Please see this out-of-the-
box runnable gist:

Something goes wrong when doing bulkUpdate and retrieving · GitHub

Look at 'showStrangeBehaviour' and set it to true. Why are not all 200
documents returned???

Regards,
Peter.

Yes, memory (at least the priority queue used by lucene) will be done for 0...B. The scan type will be without scoring, yes, and mainly aimed at things like scanning all the hits that match a specific query.

I am not sure about why it works without specifying hte index and when you do. It does not matter when working with a single index. A good place to start in order to solve it is to take your code and make it a test case in ES that recreates it. Check SearchScrollTests in the integration module, you can add a test there.

-shay.banon
On Wednesday, February 9, 2011 at 8:47 PM, Karussell wrote:

hmmh, when I'm doing an 'implicit search' (_all??) it works correctly:

// for (String fromIndex : indexList) {
String fromIndex = "";
SearchResponse rsp = client.prepareSearch().
...

On 9 Feb., 18:06, Karussell tableyourt...@googlemail.com wrote:

Hi Shay,

thanks for your answers! Do you know where the flaw could come from? I
would like to investigate/fix it as I'm unsure if reindexing works
even with this flaw ...

you will have the same problem with search and doing pagination using from / size

Does it mean that if I want docs from pages A..B then ES has to use
memory for 0..B ?

Could I overcome this problem with additional filter queries if I know
how to split up my data - e.g. based on date?

Would the scan type be without scoring (I don't need that)?

Regards,
Peter.

On 8 Feb., 23:23, Shay Banon shay.ba...@elasticsearch.com wrote:

On Wednesday, February 9, 2011 at 12:07 AM, Karussell wrote:

So, in your case, create and add it directly to a BulkRequestBuilder.

ok. what's so expensive to create that jsonBuilder?

Its not a question of expensive, but the byte buffer that is behind it is being reused on a thread local to reduce byte copying.

safeJsonBuilder

do you mean?
XContentBuilder docBuilder =
JsonXContent.unCachedContentBuilder().startObject();

Yea, that is there as well. I added safeJsonBuilder that delegates to that one.

Also, scrolling is not the best way to reindex the data

so, how would you do this at the moment. would you use simple search
(e.g. with a date filter) instead scrolling?

You will have the same problem with search and doing pagination using from / size. Thats why the scan type is needed. The overhead does not come from the scrolling implementation, but from how search is executed when it needs to do a sort in lucene, and when its distributed.

Regards,
Peter.

On 8 Feb., 22:44, Shay Banon shay.ba...@elasticsearch.com wrote:

Heya,

I will have a look. A note on your usage of XContentBuilder, when using XContentFactory#jsonBuilder, a cached version is returned, and it is expected that you will pass hte result to an API before you call it again on the same thread. So, in your case, create and add it directly to a BulkRequestBuilder.

I have just added safeJsonBuilder which will return one that is not cached per thread, but you should use the above.

Also, scrolling is not the best way to reindex the data, as it can get pretty expensive as you page through the hit list. I am planning to add a "scan" type search that will not incur this overhead (but will also do no sorting).

-shay.banon

On Tuesday, February 8, 2011 at 10:29 PM, Karussell wrote:

the problem does not occur for

put("index.number_of_shards", 1)

but does for shards>1

On 8 Feb., 21:27, Karussell tableyourt...@googlemail.com wrote:

Hi,

I simply want a functionality to reindex my docs currently indexed.
Has someone some code for this :slight_smile: ?

There is a problem either in my code or in ES when doing some
'overwriting' updates and then scrolling. Please see this out-of-the-
box runnable gist:

Something goes wrong when doing bulkUpdate and retrieving · GitHub

Look at 'showStrangeBehaviour' and set it to true. Why are not all 200
documents returned???

Regards,
Peter.

I opened this issue:

with the integration test - still failing:

On 9 Feb., 20:36, Shay Banon shay.ba...@elasticsearch.com wrote:

Yes, memory (at least the priority queue used by lucene) will be done for 0...B. The scan type will be without scoring, yes, and mainly aimed at things like scanning all the hits that match a specific query.

I am not sure about why it works without specifying hte index and when you do. It does not matter when working with a single index. A good place to start in order to solve it is to take your code and make it a test case in ES that recreates it. Check SearchScrollTests in the integration module, you can add a test there.

-shay.banon

On Wednesday, February 9, 2011 at 8:47 PM, Karussell wrote:

hmmh, when I'm doing an 'implicit search' (_all??) it works correctly:

// for (String fromIndex : indexList) {
String fromIndex = "";
SearchResponse rsp = client.prepareSearch().
...

On 9 Feb., 18:06, Karussell tableyourt...@googlemail.com wrote:

Hi Shay,

thanks for your answers! Do you know where the flaw could come from? I
would like to investigate/fix it as I'm unsure if reindexing works
even with this flaw ...

you will have the same problem with search and doing pagination using from / size

Does it mean that if I want docs from pages A..B then ES has to use
memory for 0..B ?

Could I overcome this problem with additional filter queries if I know
how to split up my data - e.g. based on date?

Would the scan type be without scoring (I don't need that)?

Regards,
Peter.

On 8 Feb., 23:23, Shay Banon shay.ba...@elasticsearch.com wrote:

On Wednesday, February 9, 2011 at 12:07 AM, Karussell wrote:

So, in your case, create and add it directly to a BulkRequestBuilder.

ok. what's so expensive to create that jsonBuilder?

Its not a question of expensive, but the byte buffer that is behind it is being reused on a thread local to reduce byte copying.

safeJsonBuilder

do you mean?
XContentBuilder docBuilder =
JsonXContent.unCachedContentBuilder().startObject();

Yea, that is there as well. I added safeJsonBuilder that delegates to that one.

Also, scrolling is not the best way to reindex the data

so, how would you do this at the moment. would you use simple search
(e.g. with a date filter) instead scrolling?

You will have the same problem with search and doing pagination using from / size. Thats why the scan type is needed. The overhead does not come from the scrolling implementation, but from how search is executed when it needs to do a sort in lucene, and when its distributed.

Regards,
Peter.

On 8 Feb., 22:44, Shay Banon shay.ba...@elasticsearch.com wrote:

Heya,

I will have a look. A note on your usage of XContentBuilder, when using XContentFactory#jsonBuilder, a cached version is returned, and it is expected that you will pass hte result to an API before you call it again on the same thread. So, in your case, create and add it directly to a BulkRequestBuilder.

I have just added safeJsonBuilder which will return one that is not cached per thread, but you should use the above.

Also, scrolling is not the best way to reindex the data, as it can get pretty expensive as you page through the hit list. I am planning to add a "scan" type search that will not incur this overhead (but will also do no sorting).

-shay.banon

On Tuesday, February 8, 2011 at 10:29 PM, Karussell wrote:

the problem does not occur for

put("index.number_of_shards", 1)

but does for shards>1

On 8 Feb., 21:27, Karussell tableyourt...@googlemail.com wrote:

Hi,

I simply want a functionality to reindex my docs currently indexed.
Has someone some code for this :slight_smile: ?

There is a problem either in my code or in ES when doing some
'overwriting' updates and then scrolling. Please see this out-of-the-
box runnable gist:

Something goes wrong when doing bulkUpdate and retrieving · GitHub

Look at 'showStrangeBehaviour' and set it to true. Why are not all 200
documents returned???

Regards,
Peter.

ok, found workaround/fix. see issue