Bug when scrolling?

Karussell1 · February 8, 2011, 8:27pm

Hi,

I simply want a functionality to reindex my docs currently indexed.
Has someone some code for this ?

There is a problem either in my code or in ES when doing some
'overwriting' updates and then scrolling. Please see this out-of-the-
box runnable gist:

gist.github.com

https://gist.github.com/karussell/817094

TestES.java

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.elasticsearch.search.SearchHit;
import org.elasticsearch.search.SearchHits;
import java.util.Collection;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.action.search.SearchType;
import org.elasticsearch.common.unit.TimeValue;
import java.util.LinkedHashMap;
import java.util.Map.Entry;

This file has been truncated. show original

Look at 'showStrangeBehaviour' and set it to true. Why are not all 200
documents returned???

Regards,
Peter.

Karussell1 · February 8, 2011, 8:29pm

the problem does not occur for

put("index.number_of_shards", 1)

but does for shards>1

On 8 Feb., 21:27, Karussell tableyourt...@googlemail.com wrote:

Hi,

I simply want a functionality to reindex my docs currently indexed.
Has someone some code for this ?

There is a problem either in my code or in ES when doing some
'overwriting' updates and then scrolling. Please see this out-of-the-
box runnable gist:

Something goes wrong when doing bulkUpdate and retrieving · GitHub

Look at 'showStrangeBehaviour' and set it to true. Why are not all 200
documents returned???

Regards,
Peter.

kimchy · February 8, 2011, 9:44pm

Heya,

I will have a look. A note on your usage of XContentBuilder, when using XContentFactory#jsonBuilder, a cached version is returned, and it is expected that you will pass hte result to an API before you call it again on the same thread. So, in your case, create and add it directly to a BulkRequestBuilder.

I have just added safeJsonBuilder which will return one that is not cached per thread, but you should use the above.

Also, scrolling is not the best way to reindex the data, as it can get pretty expensive as you page through the hit list. I am planning to add a "scan" type search that will not incur this overhead (but will also do no sorting).

-shay.banon
On Tuesday, February 8, 2011 at 10:29 PM, Karussell wrote:

the problem does not occur for

put("index.number_of_shards", 1)

but does for shards>1

On 8 Feb., 21:27, Karussell tableyourt...@googlemail.com wrote:

Hi,

I simply want a functionality to reindex my docs currently indexed.
Has someone some code for this ?

There is a problem either in my code or in ES when doing some
'overwriting' updates and then scrolling. Please see this out-of-the-
box runnable gist:

Something goes wrong when doing bulkUpdate and retrieving · GitHub

Look at 'showStrangeBehaviour' and set it to true. Why are not all 200
documents returned???

Regards,
Peter.

Karussell1 · February 8, 2011, 10:07pm

So, in your case, create and add it directly to a BulkRequestBuilder.

ok. what's so expensive to create that jsonBuilder?

safeJsonBuilder

do you mean?
XContentBuilder docBuilder =
JsonXContent.unCachedContentBuilder().startObject();

Also, scrolling is not the best way to reindex the data

so, how would you do this at the moment. would you use simple search
(e.g. with a date filter) instead scrolling?

Regards,
Peter.

On 8 Feb., 22:44, Shay Banon shay.ba...@elasticsearch.com wrote:

Heya,

I will have a look. A note on your usage of XContentBuilder, when using XContentFactory#jsonBuilder, a cached version is returned, and it is expected that you will pass hte result to an API before you call it again on the same thread. So, in your case, create and add it directly to a BulkRequestBuilder.

I have just added safeJsonBuilder which will return one that is not cached per thread, but you should use the above.

Also, scrolling is not the best way to reindex the data, as it can get pretty expensive as you page through the hit list. I am planning to add a "scan" type search that will not incur this overhead (but will also do no sorting).

-shay.banon

On Tuesday, February 8, 2011 at 10:29 PM, Karussell wrote:

the problem does not occur for

put("index.number_of_shards", 1)

but does for shards>1

On 8 Feb., 21:27, Karussell tableyourt...@googlemail.com wrote:

Hi,

I simply want a functionality to reindex my docs currently indexed.
Has someone some code for this ?

There is a problem either in my code or in ES when doing some
'overwriting' updates and then scrolling. Please see this out-of-the-
box runnable gist:

Something goes wrong when doing bulkUpdate and retrieving · GitHub

Look at 'showStrangeBehaviour' and set it to true. Why are not all 200
documents returned???

Regards,
Peter.

kimchy · February 8, 2011, 10:23pm

On Wednesday, February 9, 2011 at 12:07 AM, Karussell wrote:
So, in your case, create and add it directly to a BulkRequestBuilder.

ok. what's so expensive to create that jsonBuilder?
Its not a question of expensive, but the byte buffer that is behind it is being reused on a thread local to reduce byte copying.

safeJsonBuilder

do you mean?
XContentBuilder docBuilder =
JsonXContent.unCachedContentBuilder().startObject();
Yea, that is there as well. I added safeJsonBuilder that delegates to that one.

Also, scrolling is not the best way to reindex the data

so, how would you do this at the moment. would you use simple search
(e.g. with a date filter) instead scrolling?
You will have the same problem with search and doing pagination using from / size. Thats why the scan type is needed. The overhead does not come from the scrolling implementation, but from how search is executed when it needs to do a sort in lucene, and when its distributed.

Regards,
Peter.

On 8 Feb., 22:44, Shay Banon shay.ba...@elasticsearch.com wrote:

Heya,

I will have a look. A note on your usage of XContentBuilder, when using XContentFactory#jsonBuilder, a cached version is returned, and it is expected that you will pass hte result to an API before you call it again on the same thread. So, in your case, create and add it directly to a BulkRequestBuilder.

I have just added safeJsonBuilder which will return one that is not cached per thread, but you should use the above.

Also, scrolling is not the best way to reindex the data, as it can get pretty expensive as you page through the hit list. I am planning to add a "scan" type search that will not incur this overhead (but will also do no sorting).

-shay.banon

On Tuesday, February 8, 2011 at 10:29 PM, Karussell wrote:

the problem does not occur for

put("index.number_of_shards", 1)

but does for shards>1

On 8 Feb., 21:27, Karussell tableyourt...@googlemail.com wrote:

Hi,

I simply want a functionality to reindex my docs currently indexed.
Has someone some code for this ?

There is a problem either in my code or in ES when doing some
'overwriting' updates and then scrolling. Please see this out-of-the-
box runnable gist:

Something goes wrong when doing bulkUpdate and retrieving · GitHub

Look at 'showStrangeBehaviour' and set it to true. Why are not all 200
documents returned???

Regards,
Peter.

Karussell1 · February 9, 2011, 5:06pm

Hi Shay,

thanks for your answers! Do you know where the flaw could come from? I
would like to investigate/fix it as I'm unsure if reindexing works
even with this flaw ...

you will have the same problem with search and doing pagination using from / size

Does it mean that if I want docs from pages A..B then ES has to use
memory for 0..B ?

Could I overcome this problem with additional filter queries if I know
how to split up my data - e.g. based on date?

Would the scan type be without scoring (I don't need that)?

Regards,
Peter.

On 8 Feb., 23:23, Shay Banon shay.ba...@elasticsearch.com wrote:

On Wednesday, February 9, 2011 at 12:07 AM, Karussell wrote:

So, in your case, create and add it directly to a BulkRequestBuilder.

ok. what's so expensive to create that jsonBuilder?

Its not a question of expensive, but the byte buffer that is behind it is being reused on a thread local to reduce byte copying.

safeJsonBuilder

do you mean?
XContentBuilder docBuilder =
JsonXContent.unCachedContentBuilder().startObject();

Yea, that is there as well. I added safeJsonBuilder that delegates to that one.

Also, scrolling is not the best way to reindex the data

so, how would you do this at the moment. would you use simple search
(e.g. with a date filter) instead scrolling?

You will have the same problem with search and doing pagination using from / size. Thats why the scan type is needed. The overhead does not come from the scrolling implementation, but from how search is executed when it needs to do a sort in lucene, and when its distributed.

Regards,
Peter.

On 8 Feb., 22:44, Shay Banon shay.ba...@elasticsearch.com wrote:

Heya,

I will have a look. A note on your usage of XContentBuilder, when using XContentFactory#jsonBuilder, a cached version is returned, and it is expected that you will pass hte result to an API before you call it again on the same thread. So, in your case, create and add it directly to a BulkRequestBuilder.

I have just added safeJsonBuilder which will return one that is not cached per thread, but you should use the above.

Also, scrolling is not the best way to reindex the data, as it can get pretty expensive as you page through the hit list. I am planning to add a "scan" type search that will not incur this overhead (but will also do no sorting).

-shay.banon

On Tuesday, February 8, 2011 at 10:29 PM, Karussell wrote:

the problem does not occur for

put("index.number_of_shards", 1)

but does for shards>1

On 8 Feb., 21:27, Karussell tableyourt...@googlemail.com wrote:

Hi,

I simply want a functionality to reindex my docs currently indexed.
Has someone some code for this ?

There is a problem either in my code or in ES when doing some
'overwriting' updates and then scrolling. Please see this out-of-the-
box runnable gist:

Something goes wrong when doing bulkUpdate and retrieving · GitHub

Look at 'showStrangeBehaviour' and set it to true. Why are not all 200
documents returned???

Regards,
Peter.

Karussell1 · February 9, 2011, 6:47pm

hmmh, when I'm doing an 'implicit search' (_all??) it works correctly:

// for (String fromIndex : indexList) {
String fromIndex = "";
SearchResponse rsp = client.prepareSearch().
...

On 9 Feb., 18:06, Karussell tableyourt...@googlemail.com wrote:

Hi Shay,

thanks for your answers! Do you know where the flaw could come from? I
would like to investigate/fix it as I'm unsure if reindexing works
even with this flaw ...

you will have the same problem with search and doing pagination using from / size

Does it mean that if I want docs from pages A..B then ES has to use
memory for 0..B ?

Could I overcome this problem with additional filter queries if I know
how to split up my data - e.g. based on date?

Would the scan type be without scoring (I don't need that)?

Regards,
Peter.

On 8 Feb., 23:23, Shay Banon shay.ba...@elasticsearch.com wrote:

On Wednesday, February 9, 2011 at 12:07 AM, Karussell wrote:

So, in your case, create and add it directly to a BulkRequestBuilder.

ok. what's so expensive to create that jsonBuilder?

Its not a question of expensive, but the byte buffer that is behind it is being reused on a thread local to reduce byte copying.

safeJsonBuilder

do you mean?
XContentBuilder docBuilder =
JsonXContent.unCachedContentBuilder().startObject();

Yea, that is there as well. I added safeJsonBuilder that delegates to that one.

Also, scrolling is not the best way to reindex the data

so, how would you do this at the moment. would you use simple search
(e.g. with a date filter) instead scrolling?

You will have the same problem with search and doing pagination using from / size. Thats why the scan type is needed. The overhead does not come from the scrolling implementation, but from how search is executed when it needs to do a sort in lucene, and when its distributed.

Regards,
Peter.

On 8 Feb., 22:44, Shay Banon shay.ba...@elasticsearch.com wrote:

Heya,

I will have a look. A note on your usage of XContentBuilder, when using XContentFactory#jsonBuilder, a cached version is returned, and it is expected that you will pass hte result to an API before you call it again on the same thread. So, in your case, create and add it directly to a BulkRequestBuilder.

I have just added safeJsonBuilder which will return one that is not cached per thread, but you should use the above.

Also, scrolling is not the best way to reindex the data, as it can get pretty expensive as you page through the hit list. I am planning to add a "scan" type search that will not incur this overhead (but will also do no sorting).

-shay.banon

On Tuesday, February 8, 2011 at 10:29 PM, Karussell wrote:

the problem does not occur for

put("index.number_of_shards", 1)

but does for shards>1

On 8 Feb., 21:27, Karussell tableyourt...@googlemail.com wrote:

Hi,

I simply want a functionality to reindex my docs currently indexed.
Has someone some code for this ?

There is a problem either in my code or in ES when doing some
'overwriting' updates and then scrolling. Please see this out-of-the-
box runnable gist:

Something goes wrong when doing bulkUpdate and retrieving · GitHub

Look at 'showStrangeBehaviour' and set it to true. Why are not all 200
documents returned???

Regards,
Peter.

kimchy · February 9, 2011, 7:36pm

Yes, memory (at least the priority queue used by lucene) will be done for 0...B. The scan type will be without scoring, yes, and mainly aimed at things like scanning all the hits that match a specific query.

I am not sure about why it works without specifying hte index and when you do. It does not matter when working with a single index. A good place to start in order to solve it is to take your code and make it a test case in ES that recreates it. Check SearchScrollTests in the integration module, you can add a test there.

-shay.banon
On Wednesday, February 9, 2011 at 8:47 PM, Karussell wrote:

hmmh, when I'm doing an 'implicit search' (_all??) it works correctly:

// for (String fromIndex : indexList) {
String fromIndex = "";
SearchResponse rsp = client.prepareSearch().
...

On 9 Feb., 18:06, Karussell tableyourt...@googlemail.com wrote:

Hi Shay,

thanks for your answers! Do you know where the flaw could come from? I
would like to investigate/fix it as I'm unsure if reindexing works
even with this flaw ...

you will have the same problem with search and doing pagination using from / size

Does it mean that if I want docs from pages A..B then ES has to use
memory for 0..B ?

Could I overcome this problem with additional filter queries if I know
how to split up my data - e.g. based on date?

Would the scan type be without scoring (I don't need that)?

Regards,
Peter.

On 8 Feb., 23:23, Shay Banon shay.ba...@elasticsearch.com wrote:

On Wednesday, February 9, 2011 at 12:07 AM, Karussell wrote:

So, in your case, create and add it directly to a BulkRequestBuilder.

ok. what's so expensive to create that jsonBuilder?

Its not a question of expensive, but the byte buffer that is behind it is being reused on a thread local to reduce byte copying.

safeJsonBuilder

do you mean?
XContentBuilder docBuilder =
JsonXContent.unCachedContentBuilder().startObject();

Yea, that is there as well. I added safeJsonBuilder that delegates to that one.

Also, scrolling is not the best way to reindex the data

so, how would you do this at the moment. would you use simple search
(e.g. with a date filter) instead scrolling?

You will have the same problem with search and doing pagination using from / size. Thats why the scan type is needed. The overhead does not come from the scrolling implementation, but from how search is executed when it needs to do a sort in lucene, and when its distributed.

Regards,
Peter.

On 8 Feb., 22:44, Shay Banon shay.ba...@elasticsearch.com wrote:

Heya,

I will have a look. A note on your usage of XContentBuilder, when using XContentFactory#jsonBuilder, a cached version is returned, and it is expected that you will pass hte result to an API before you call it again on the same thread. So, in your case, create and add it directly to a BulkRequestBuilder.

I have just added safeJsonBuilder which will return one that is not cached per thread, but you should use the above.

Also, scrolling is not the best way to reindex the data, as it can get pretty expensive as you page through the hit list. I am planning to add a "scan" type search that will not incur this overhead (but will also do no sorting).

-shay.banon

On Tuesday, February 8, 2011 at 10:29 PM, Karussell wrote:

the problem does not occur for

put("index.number_of_shards", 1)

but does for shards>1

On 8 Feb., 21:27, Karussell tableyourt...@googlemail.com wrote:

Hi,

I simply want a functionality to reindex my docs currently indexed.
Has someone some code for this ?

There is a problem either in my code or in ES when doing some
'overwriting' updates and then scrolling. Please see this out-of-the-
box runnable gist:

Something goes wrong when doing bulkUpdate and retrieving · GitHub

Look at 'showStrangeBehaviour' and set it to true. Why are not all 200
documents returned???

Regards,
Peter.

Karussell1 · February 9, 2011, 8:47pm

I opened this issue:

with the integration test - still failing:

gist.github.com

https://gist.github.com/karussell/819239

ScrollTests.java

// this GIST is used for this ticket: https://gist.github.com/819239
// SOLVED when sort against id !

package org.elasticsearch.test.integration.search.scroll;

import org.elasticsearch.action.WriteConsistencyLevel;
import org.elasticsearch.action.admin.indices.refresh.RefreshRequest;
import org.elasticsearch.action.bulk.BulkResponse;
import org.elasticsearch.action.count.CountResponse;
import org.elasticsearch.action.search.SearchResponse;

This file has been truncated. show original

On 9 Feb., 20:36, Shay Banon shay.ba...@elasticsearch.com wrote:

Yes, memory (at least the priority queue used by lucene) will be done for 0...B. The scan type will be without scoring, yes, and mainly aimed at things like scanning all the hits that match a specific query.

I am not sure about why it works without specifying hte index and when you do. It does not matter when working with a single index. A good place to start in order to solve it is to take your code and make it a test case in ES that recreates it. Check SearchScrollTests in the integration module, you can add a test there.

-shay.banon

On Wednesday, February 9, 2011 at 8:47 PM, Karussell wrote:

hmmh, when I'm doing an 'implicit search' (_all??) it works correctly:

// for (String fromIndex : indexList) {
String fromIndex = "";
SearchResponse rsp = client.prepareSearch().
...

On 9 Feb., 18:06, Karussell tableyourt...@googlemail.com wrote:

Hi Shay,

thanks for your answers! Do you know where the flaw could come from? I
would like to investigate/fix it as I'm unsure if reindexing works
even with this flaw ...

you will have the same problem with search and doing pagination using from / size

Does it mean that if I want docs from pages A..B then ES has to use
memory for 0..B ?

Could I overcome this problem with additional filter queries if I know
how to split up my data - e.g. based on date?

Would the scan type be without scoring (I don't need that)?

Regards,
Peter.

On 8 Feb., 23:23, Shay Banon shay.ba...@elasticsearch.com wrote:

On Wednesday, February 9, 2011 at 12:07 AM, Karussell wrote:

So, in your case, create and add it directly to a BulkRequestBuilder.

ok. what's so expensive to create that jsonBuilder?

Its not a question of expensive, but the byte buffer that is behind it is being reused on a thread local to reduce byte copying.

safeJsonBuilder

do you mean?
XContentBuilder docBuilder =
JsonXContent.unCachedContentBuilder().startObject();

Yea, that is there as well. I added safeJsonBuilder that delegates to that one.

Also, scrolling is not the best way to reindex the data

so, how would you do this at the moment. would you use simple search
(e.g. with a date filter) instead scrolling?

You will have the same problem with search and doing pagination using from / size. Thats why the scan type is needed. The overhead does not come from the scrolling implementation, but from how search is executed when it needs to do a sort in lucene, and when its distributed.

Regards,
Peter.

On 8 Feb., 22:44, Shay Banon shay.ba...@elasticsearch.com wrote:

Heya,

I will have a look. A note on your usage of XContentBuilder, when using XContentFactory#jsonBuilder, a cached version is returned, and it is expected that you will pass hte result to an API before you call it again on the same thread. So, in your case, create and add it directly to a BulkRequestBuilder.

I have just added safeJsonBuilder which will return one that is not cached per thread, but you should use the above.

Also, scrolling is not the best way to reindex the data, as it can get pretty expensive as you page through the hit list. I am planning to add a "scan" type search that will not incur this overhead (but will also do no sorting).

-shay.banon

On Tuesday, February 8, 2011 at 10:29 PM, Karussell wrote:

the problem does not occur for

put("index.number_of_shards", 1)

but does for shards>1

On 8 Feb., 21:27, Karussell tableyourt...@googlemail.com wrote:

Hi,

I simply want a functionality to reindex my docs currently indexed.
Has someone some code for this ?

There is a problem either in my code or in ES when doing some
'overwriting' updates and then scrolling. Please see this out-of-the-
box runnable gist:

Something goes wrong when doing bulkUpdate and retrieving · GitHub

Look at 'showStrangeBehaviour' and set it to true. Why are not all 200
documents returned???

Regards,
Peter.

Karussell1 · February 9, 2011, 9:35pm

ok, found workaround/fix. see issue

Topic		Replies	Views
Updating every document to prepare for reindexing Elasticsearch reindex	1	317	September 7, 2023
Elasticsearch Reindex Big Index Elasticsearch	5	1765	January 15, 2018
Update document while scrolling Elasticsearch	3	970	July 6, 2017
Reindex using scroll api Elasticsearch	5	2114	July 5, 2017
Suggestions for reindexing individual documents Elasticsearch	4	315	July 6, 2017

Bug when scrolling?

Related topics