Work-around for the scroll issue doesn't seems to work propertly (also)


(Tomislav Poljak) #1

Hi,
since scrolling is still broken (in all latest versions, 0.13 included)

'from' parameter work-around seems to be recommend way to go

http://elasticsearch-users.115913.n3.nabble.com/totalHits-gets-changed-unexpectedly-while-scrolling-SearchResponse-td1575408.html

But, it seems (to me) 'from' parameter work-around has similar/same
problems when updating documents matched over multiple/several shards.

Here is the bug reconstruction:

  1. Index 500 docs (default index settings, 5 shards)

for (int i = 0; i < 500; i++) {
IndexResponse response = client.prepareIndex("twitter", "tweet", Integer.toString(i))
.setSource(jsonBuilder()
.startObject()
.field("user", "kimchy")
.field("postDate", new Date())
.field("message", "trying out Elastic Search")
.endObject()
)
.execute()
.actionGet();
}

  1. Check number of docs in index for query :

curl -XGET 'http://localhost:9200/twitter/tweet/_count?q=:'
{"count":500,"_shards":{"total":5,"successful":5,"failed":0}}

  1. Update documents all docs (docs matched by :) by 50 docs chunks

for (int from = 0; from < 500; from+=50) {

        SearchResponse response =

client.prepareSearch("twitter").setTypes("tweet").setSearchType(SearchType.QUERY_THEN_FETCH).setQuery(queryString(":")).setFrom(from).setSize(50).execute().actionGet();

        for (SearchHit searchHit : response.hits().hits()) {
            Map<String, Object> map = searchHit.sourceAsMap();
            map.put("message", "updated");
            client.index(

indexRequest("twitter").type("tweet").id(searchHit.getId()).source(map)).actionGet();
}
}

  1. Check for number of updated docs:

curl -XGET
'http://localhost:9200/twitter/tweet/_count?q=message:updated'
{"count":397,"_shards":{"total":5,"successful":5,"failed":0}}

As you can see only 397 documents got updated.

It seems 500 updates do occur, but some docs are matched and updated
twice and others never get updated. I've added debug code in update:

Set updatedDocs = new HashSet();

    for (int from = 0; from < 500; from+=50) {

        SearchResponse response =

client.prepareSearch("twitter").setTypes("tweet").setSearchType(SearchType.QUERY_THEN_FETCH).setQuery(queryString(":")).setFrom(from).setSize(50).execute().actionGet();

        for (SearchHit searchHit : response.hits().hits()) {
            Map<String, Object> map = searchHit.sourceAsMap();
            map.put("message", "updated");
            client.index(

indexRequest("twitter").type("tweet").id(searchHit.getId()).source(map)).actionGet();

            //debug
            if(updatedDocs.contains(searchHit.getId())){
                System.out.println("Already updated doc, ID: " +

searchHit.getId());
}else{
updatedDocs.add(searchHit.getId());
}
}
}

and got duplicates on output:

Already updated doc, ID: 405
Already updated doc, ID: 406
Already updated doc, ID: 414
Already updated doc, ID: 413
Already updated doc, ID: 412
Already updated doc, ID: 419

Also, one interesting thing is that number of actually updated documents
seems to change from time to time for the same code. For the example
above: first time only 300 got updated, second time 397 docs were
updated and on last try 350 docs (I've deleted local data/* and
re-indexed between update tests).

Tomislav


(Shay Banon) #2

Hi,

The problem you have is the fact that there is no
ordering guaranteed when doing match all query. What you would want to do is
introduce some sort of ordering (timestamp for example). Then, you have two
options, either start from the back and more forward (while updating the
timestamp as well). Note that you will need to refresh after each bulk
indexing to "see" the latest updates.

-shay.banon

On Tue, Nov 23, 2010 at 6:05 PM, Tomislav Poljak tpoljak@gmail.com wrote:

Hi,
since scrolling is still broken (in all latest versions, 0.13 included)

https://github.com/elasticsearch/elasticsearch/issues#issue/136

'from' parameter work-around seems to be recommend way to go

http://elasticsearch-users.115913.n3.nabble.com/totalHits-gets-changed-unexpectedly-while-scrolling-SearchResponse-td1575408.html

But, it seems (to me) 'from' parameter work-around has similar/same
problems when updating documents matched over multiple/several shards.

Here is the bug reconstruction:

  1. Index 500 docs (default index settings, 5 shards)

for (int i = 0; i < 500; i++) {
IndexResponse response = client.prepareIndex("twitter", "tweet",
Integer.toString(i))
.setSource(jsonBuilder()
.startObject()
.field("user", "kimchy")
.field("postDate", new Date())
.field("message", "trying out Elastic Search")
.endObject()
)
.execute()
.actionGet();
}

  1. Check number of docs in index for query :

curl -XGET 'http://localhost:9200/twitter/tweet/_count?q=:'
{"count":500,"_shards":{"total":5,"successful":5,"failed":0}}

  1. Update documents all docs (docs matched by :) by 50 docs chunks

for (int from = 0; from < 500; from+=50) {

       SearchResponse response =

client.prepareSearch("twitter").setTypes("tweet").setSearchType(SearchType.QUERY_THEN_FETCH).setQuery(queryString(":")).setFrom(from).setSize(50).execute().actionGet();

       for (SearchHit searchHit : response.hits().hits()) {
           Map<String, Object> map = searchHit.sourceAsMap();
           map.put("message", "updated");
           client.index(

indexRequest("twitter").type("tweet").id(searchHit.getId()).source(map)).actionGet();
}
}

  1. Check for number of updated docs:

curl -XGET
'http://localhost:9200/twitter/tweet/_count?q=message:updated'
{"count":397,"_shards":{"total":5,"successful":5,"failed":0}}

As you can see only 397 documents got updated.

It seems 500 updates do occur, but some docs are matched and updated
twice and others never get updated. I've added debug code in update:

Set updatedDocs = new HashSet();

   for (int from = 0; from < 500; from+=50) {

       SearchResponse response =

client.prepareSearch("twitter").setTypes("tweet").setSearchType(SearchType.QUERY_THEN_FETCH).setQuery(queryString(":")).setFrom(from).setSize(50).execute().actionGet();

       for (SearchHit searchHit : response.hits().hits()) {
           Map<String, Object> map = searchHit.sourceAsMap();
           map.put("message", "updated");
           client.index(

indexRequest("twitter").type("tweet").id(searchHit.getId()).source(map)).actionGet();

           //debug
           if(updatedDocs.contains(searchHit.getId())){
               System.out.println("Already updated doc, ID: " +

searchHit.getId());
}else{
updatedDocs.add(searchHit.getId());
}
}
}

and got duplicates on output:

Already updated doc, ID: 405
Already updated doc, ID: 406
Already updated doc, ID: 414
Already updated doc, ID: 413
Already updated doc, ID: 412
Already updated doc, ID: 419

Also, one interesting thing is that number of actually updated documents
seems to change from time to time for the same code. For the example
above: first time only 300 got updated, second time 397 docs were
updated and on last try 350 docs (I've deleted local data/* and
re-indexed between update tests).

Tomislav


(Shay Banon) #3

One more option, if you are doing a full reindexing, is to reindex into a
fresh index. This will be much faster since there won't be any need to
handle deletes and expunge them later on from the index.

On Tue, Nov 23, 2010 at 11:34 PM, Shay Banon
shay.banon@elasticsearch.comwrote:

Hi,

The problem you have is the fact that there is no
ordering guaranteed when doing match all query. What you would want to do is
introduce some sort of ordering (timestamp for example). Then, you have two
options, either start from the back and more forward (while updating the
timestamp as well). Note that you will need to refresh after each bulk
indexing to "see" the latest updates.

-shay.banon

On Tue, Nov 23, 2010 at 6:05 PM, Tomislav Poljak tpoljak@gmail.comwrote:

Hi,
since scrolling is still broken (in all latest versions, 0.13 included)

https://github.com/elasticsearch/elasticsearch/issues#issue/136

'from' parameter work-around seems to be recommend way to go

http://elasticsearch-users.115913.n3.nabble.com/totalHits-gets-changed-unexpectedly-while-scrolling-SearchResponse-td1575408.html

But, it seems (to me) 'from' parameter work-around has similar/same
problems when updating documents matched over multiple/several shards.

Here is the bug reconstruction:

  1. Index 500 docs (default index settings, 5 shards)

for (int i = 0; i < 500; i++) {
IndexResponse response = client.prepareIndex("twitter",
"tweet", Integer.toString(i))
.setSource(jsonBuilder()
.startObject()
.field("user", "kimchy")
.field("postDate", new Date())
.field("message", "trying out Elastic Search")
.endObject()
)
.execute()
.actionGet();
}

  1. Check number of docs in index for query :

curl -XGET 'http://localhost:9200/twitter/tweet/_count?q=:'
{"count":500,"_shards":{"total":5,"successful":5,"failed":0}}

  1. Update documents all docs (docs matched by :) by 50 docs chunks

for (int from = 0; from < 500; from+=50) {

       SearchResponse response =

client.prepareSearch("twitter").setTypes("tweet").setSearchType(SearchType.QUERY_THEN_FETCH).setQuery(queryString(":")).setFrom(from).setSize(50).execute().actionGet();

       for (SearchHit searchHit : response.hits().hits()) {
           Map<String, Object> map = searchHit.sourceAsMap();
           map.put("message", "updated");
           client.index(

indexRequest("twitter").type("tweet").id(searchHit.getId()).source(map)).actionGet();
}
}

  1. Check for number of updated docs:

curl -XGET
'http://localhost:9200/twitter/tweet/_count?q=message:updated'
{"count":397,"_shards":{"total":5,"successful":5,"failed":0}}

As you can see only 397 documents got updated.

It seems 500 updates do occur, but some docs are matched and updated
twice and others never get updated. I've added debug code in update:

Set updatedDocs = new HashSet();

   for (int from = 0; from < 500; from+=50) {

       SearchResponse response =

client.prepareSearch("twitter").setTypes("tweet").setSearchType(SearchType.QUERY_THEN_FETCH).setQuery(queryString(":")).setFrom(from).setSize(50).execute().actionGet();

       for (SearchHit searchHit : response.hits().hits()) {
           Map<String, Object> map = searchHit.sourceAsMap();
           map.put("message", "updated");
           client.index(

indexRequest("twitter").type("tweet").id(searchHit.getId()).source(map)).actionGet();

           //debug
           if(updatedDocs.contains(searchHit.getId())){
               System.out.println("Already updated doc, ID: " +

searchHit.getId());
}else{
updatedDocs.add(searchHit.getId());
}
}
}

and got duplicates on output:

Already updated doc, ID: 405
Already updated doc, ID: 406
Already updated doc, ID: 414
Already updated doc, ID: 413
Already updated doc, ID: 412
Already updated doc, ID: 419

Also, one interesting thing is that number of actually updated documents
seems to change from time to time for the same code. For the example
above: first time only 300 got updated, second time 397 docs were
updated and on last try 350 docs (I've deleted local data/* and
re-indexed between update tests).

Tomislav


(Tomislav Poljak) #4

Hi,
this is not full re-indexing, we only need to be able to update all
documents matched by some/arbitrary "field value" query.

For example update all docs where user is kimchy -> if
queryString("user:kimchy") is used instead of : in bug reconstruction
example, update still doesn't update all matching documents:

curl -XGET
'http://localhost:9200/twitter/tweet/_count?q=message:updated'
{"count":370,"_shards":{"total":5,"successful":5,"failed":0}}

Can you please advise what is currently recommended best/working way to
update all docs where 'user' is 'kimchy'?

Tomislav

On Tue, 2010-11-23 at 23:35 +0200, Shay Banon wrote:

One more option, if you are doing a full reindexing, is to reindex
into a fresh index. This will be much faster since there won't be any
need to handle deletes and expunge them later on from the index.

On Tue, Nov 23, 2010 at 11:34 PM, Shay Banon
shay.banon@elasticsearch.com wrote:
Hi,

       The problem you have is the fact that there is no
    ordering guaranteed when doing match all query. What you would
    want to do is introduce some sort of ordering (timestamp for
    example). Then, you have two options, either start from the
    back and more forward (while updating the timestamp as well).
    Note that you will need to refresh after each bulk indexing to
    "see" the latest updates.
    
    
    -shay.banon
    
    
    
    On Tue, Nov 23, 2010 at 6:05 PM, Tomislav Poljak
    <tpoljak@gmail.com> wrote:
            Hi,
            since scrolling is still broken (in all latest
            versions, 0.13 included)
            
            https://github.com/elasticsearch/elasticsearch/issues#issue/136
            
            'from' parameter work-around seems to be recommend way
            to go
            
            http://elasticsearch-users.115913.n3.nabble.com/totalHits-gets-changed-unexpectedly-while-scrolling-SearchResponse-td1575408.html
            
            But, it seems (to me) 'from' parameter work-around has
            similar/same
            problems when updating documents matched over
            multiple/several shards.
            
            Here is the bug reconstruction:
            
            1. Index 500 docs (default index settings, 5 shards)
            
            for (int i = 0; i < 500; i++) {
                       IndexResponse response =
            client.prepareIndex("twitter", "tweet",
            Integer.toString(i))
                       .setSource(jsonBuilder()
                                   .startObject()
                                       .field("user", "kimchy")
                                       .field("postDate", new
            Date())
                                       .field("message", "trying
            out Elastic Search")
                                   .endObject()
                                 )
                       .execute()
                       .actionGet();
                   }
            
            
            2. Check number of docs in index for query *:*
            
            curl -XGET
            'http://localhost:9200/twitter/tweet/_count?q=*:*'
            {"count":500,"_shards":{"total":5,"successful":5,"failed":0}}
            
            
            
            3. Update documents all docs (docs matched by *:*) by
            50 docs chunks
            
            for (int from = 0; from < 500; from+=50) {
            
                       SearchResponse response =
            client.prepareSearch("twitter").setTypes("tweet").setSearchType(SearchType.QUERY_THEN_FETCH).setQuery(queryString("*:*")).setFrom(from).setSize(50).execute().actionGet();
            
                       for (SearchHit searchHit :
            response.hits().hits()) {
                           Map<String, Object> map =
            searchHit.sourceAsMap();
                           map.put("message", "updated");
                           client.index(
            
            indexRequest("twitter").type("tweet").id(searchHit.getId()).source(map)).actionGet();
                       }
                   }
            
            
            4. Check for number of updated docs:
            
            curl -XGET
            'http://localhost:9200/twitter/tweet/_count?q=message:updated'
            {"count":397,"_shards":{"total":5,"successful":5,"failed":0}}
            
            
            As you can see only 397 documents got updated.
            
            
            It seems 500 updates do occur, but some docs are
            matched and updated
            twice and others never get updated. I've added debug
            code in update:
            
             Set<String> updatedDocs = new HashSet<String>();
            
                   for (int from = 0; from < 500; from+=50) {
            
                       SearchResponse response =
            client.prepareSearch("twitter").setTypes("tweet").setSearchType(SearchType.QUERY_THEN_FETCH).setQuery(queryString("*:*")).setFrom(from).setSize(50).execute().actionGet();
            
                       for (SearchHit searchHit :
            response.hits().hits()) {
                           Map<String, Object> map =
            searchHit.sourceAsMap();
                           map.put("message", "updated");
                           client.index(
            
            indexRequest("twitter").type("tweet").id(searchHit.getId()).source(map)).actionGet();
            
                           //debug
            
             if(updatedDocs.contains(searchHit.getId())){
                               System.out.println("Already updated
            doc, ID: " +
            searchHit.getId());
                           }else{
                               updatedDocs.add(searchHit.getId());
                           }
                       }
                   }
            
            and got duplicates on output:
            
            Already updated doc, ID: 405
            Already updated doc, ID: 406
            Already updated doc, ID: 414
            Already updated doc, ID: 413
            Already updated doc, ID: 412
            Already updated doc, ID: 419
            
            Also, one interesting thing is that number of actually
            updated documents
            seems to change from time to time for the same code.
            For the example
            above: first time only 300 got updated, second time
            397 docs were
            updated and on last try 350 docs (I've deleted local
            data/* and
            re-indexed between update tests).
            
            
            Tomislav

(Shay Banon) #5

Let me try and explain again, when you do the to/from walking with a query,
you might actually get docs that you already updated back and update them
again. You can try and filter out based on the updated value the docs you
want, or use timestamps as I suggested before to make sure you only update
docs you want.

On Wed, Nov 24, 2010 at 10:20 AM, Tomislav Poljak tpoljak@gmail.com wrote:

Hi,
this is not full re-indexing, we only need to be able to update all
documents matched by some/arbitrary "field value" query.

For example update all docs where user is kimchy -> if
queryString("user:kimchy") is used instead of : in bug reconstruction
example, update still doesn't update all matching documents:

curl -XGET
'http://localhost:9200/twitter/tweet/_count?q=message:updated'
{"count":370,"_shards":{"total":5,"successful":5,"failed":0}}

Can you please advise what is currently recommended best/working way to
update all docs where 'user' is 'kimchy'?

Tomislav

On Tue, 2010-11-23 at 23:35 +0200, Shay Banon wrote:

One more option, if you are doing a full reindexing, is to reindex
into a fresh index. This will be much faster since there won't be any
need to handle deletes and expunge them later on from the index.

On Tue, Nov 23, 2010 at 11:34 PM, Shay Banon
shay.banon@elasticsearch.com wrote:
Hi,

       The problem you have is the fact that there is no
    ordering guaranteed when doing match all query. What you would
    want to do is introduce some sort of ordering (timestamp for
    example). Then, you have two options, either start from the
    back and more forward (while updating the timestamp as well).
    Note that you will need to refresh after each bulk indexing to
    "see" the latest updates.


    -shay.banon



    On Tue, Nov 23, 2010 at 6:05 PM, Tomislav Poljak
    <tpoljak@gmail.com> wrote:
            Hi,
            since scrolling is still broken (in all latest
            versions, 0.13 included)

https://github.com/elasticsearch/elasticsearch/issues#issue/136

            'from' parameter work-around seems to be recommend way
            to go

http://elasticsearch-users.115913.n3.nabble.com/totalHits-gets-changed-unexpectedly-while-scrolling-SearchResponse-td1575408.html

            But, it seems (to me) 'from' parameter work-around has
            similar/same
            problems when updating documents matched over
            multiple/several shards.

            Here is the bug reconstruction:

            1. Index 500 docs (default index settings, 5 shards)

            for (int i = 0; i < 500; i++) {
                       IndexResponse response =
            client.prepareIndex("twitter", "tweet",
            Integer.toString(i))
                       .setSource(jsonBuilder()
                                   .startObject()
                                       .field("user", "kimchy")
                                       .field("postDate", new
            Date())
                                       .field("message", "trying
            out Elastic Search")
                                   .endObject()
                                 )
                       .execute()
                       .actionGet();
                   }


            2. Check number of docs in index for query *:*

            curl -XGET
            'http://localhost:9200/twitter/tweet/_count?q=*:*'

{"count":500,"_shards":{"total":5,"successful":5,"failed":0}}

            3. Update documents all docs (docs matched by *:*) by
            50 docs chunks

            for (int from = 0; from < 500; from+=50) {

                       SearchResponse response =

client.prepareSearch("twitter").setTypes("tweet").setSearchType(SearchType.QUERY_THEN_FETCH).setQuery(queryString(":")).setFrom(from).setSize(50).execute().actionGet();

                       for (SearchHit searchHit :
            response.hits().hits()) {
                           Map<String, Object> map =
            searchHit.sourceAsMap();
                           map.put("message", "updated");
                           client.index(

indexRequest("twitter").type("tweet").id(searchHit.getId()).source(map)).actionGet();

                       }
                   }


            4. Check for number of updated docs:

            curl -XGET
            '

http://localhost:9200/twitter/tweet/_count?q=message:updated'

{"count":397,"_shards":{"total":5,"successful":5,"failed":0}}

            As you can see only 397 documents got updated.


            It seems 500 updates do occur, but some docs are
            matched and updated
            twice and others never get updated. I've added debug
            code in update:

             Set<String> updatedDocs = new HashSet<String>();

                   for (int from = 0; from < 500; from+=50) {

                       SearchResponse response =

client.prepareSearch("twitter").setTypes("tweet").setSearchType(SearchType.QUERY_THEN_FETCH).setQuery(queryString(":")).setFrom(from).setSize(50).execute().actionGet();

                       for (SearchHit searchHit :
            response.hits().hits()) {
                           Map<String, Object> map =
            searchHit.sourceAsMap();
                           map.put("message", "updated");
                           client.index(

indexRequest("twitter").type("tweet").id(searchHit.getId()).source(map)).actionGet();

                           //debug

             if(updatedDocs.contains(searchHit.getId())){
                               System.out.println("Already updated
            doc, ID: " +
            searchHit.getId());
                           }else{
                               updatedDocs.add(searchHit.getId());
                           }
                       }
                   }

            and got duplicates on output:

            Already updated doc, ID: 405
            Already updated doc, ID: 406
            Already updated doc, ID: 414
            Already updated doc, ID: 413
            Already updated doc, ID: 412
            Already updated doc, ID: 419

            Also, one interesting thing is that number of actually
            updated documents
            seems to change from time to time for the same code.
            For the example
            above: first time only 300 got updated, second time
            397 docs were
            updated and on last try 350 docs (I've deleted local
            data/* and
            re-indexed between update tests).


            Tomislav

(Tomislav Poljak) #6

Hi,
I have additional question regarding update for a given query. We are
actually trying to build concurrent/multithread update solution where
any user can update any result set in a bulk update (for any given
query). When scroll is used (with sorting applied) and two scroll
updater threads try to operate on common/mutual documents, one thread
doesn't update all documents.

I have a question regarding the proposed scenario (from your response):

"... start from the back and more forward (while updating the timestamp
as well). Note that you will need to refresh after each bulk indexing to
"see" the latest updates"

Won't this approach also have problems in multithreaded/multiuser
environment where multiple users can issue concurrent update commands on
mutual/common documents? For example:

one update thread updates document's timestamp and other update thread
doesn't consider it for its update (which is a different update),
because it has recently been touched.

What would be the best/recommend approach for the large concurrent
updates (any good ideas :)?

Tomislav

On Wed, 2010-11-24 at 12:53 +0200, Shay Banon wrote:

Let me try and explain again, when you do the to/from walking with a
query, you might actually get docs that you already updated back and
update them again. You can try and filter out based on the updated
value the docs you want, or use timestamps as I suggested before to
make sure you only update docs you want.

On Wed, Nov 24, 2010 at 10:20 AM, Tomislav Poljak tpoljak@gmail.com
wrote:
Hi,
this is not full re-indexing, we only need to be able to
update all
documents matched by some/arbitrary "field value" query.

    For example update all docs where user is kimchy -> if
    queryString("user:kimchy") is used instead of *:* in bug
    reconstruction
    example, update still doesn't update all matching documents:
    
    curl -XGET
    'http://localhost:9200/twitter/tweet/_count?q=message:updated'
    
    {"count":370,"_shards":{"total":5,"successful":5,"failed":0}}
    
    Can you please advise what is currently recommended
    best/working way to
    update all docs where 'user' is 'kimchy'?
    
    
    Tomislav
    
    
    
    
    
    
    On Tue, 2010-11-23 at 23:35 +0200, Shay Banon wrote:
    > One more option, if you are doing a full reindexing, is to
    reindex
    > into a fresh index. This will be much faster since there
    won't be any
    > need to handle deletes and expunge them later on from the
    index.
    >
    > On Tue, Nov 23, 2010 at 11:34 PM, Shay Banon
    > <shay.banon@elasticsearch.com> wrote:
    >         Hi,
    >
    >
    >            The problem you have is the fact that there is no
    >         ordering guaranteed when doing match all query. What
    you would
    >         want to do is introduce some sort of ordering
    (timestamp for
    >         example). Then, you have two options, either start
    from the
    >         back and more forward (while updating the timestamp
    as well).
    >         Note that you will need to refresh after each bulk
    indexing to
    >         "see" the latest updates.
    >
    >
    >         -shay.banon
    >
    >
    >
    >         On Tue, Nov 23, 2010 at 6:05 PM, Tomislav Poljak
    >         <tpoljak@gmail.com> wrote:
    >                 Hi,
    >                 since scrolling is still broken (in all
    latest
    >                 versions, 0.13 included)
    >
    >
    https://github.com/elasticsearch/elasticsearch/issues#issue/136
    >
    >                 'from' parameter work-around seems to be
    recommend way
    >                 to go
    >
    >
    http://elasticsearch-users.115913.n3.nabble.com/totalHits-gets-changed-unexpectedly-while-scrolling-SearchResponse-td1575408.html
    >
    >                 But, it seems (to me) 'from' parameter
    work-around has
    >                 similar/same
    >                 problems when updating documents matched
    over
    >                 multiple/several shards.
    >
    >                 Here is the bug reconstruction:
    >
    >                 1. Index 500 docs (default index settings, 5
    shards)
    >
    >                 for (int i = 0; i < 500; i++) {
    >                            IndexResponse response =
    >                 client.prepareIndex("twitter", "tweet",
    >                 Integer.toString(i))
    >                            .setSource(jsonBuilder()
    >                                        .startObject()
    >                                            .field("user",
    "kimchy")
    >
     .field("postDate", new
    >                 Date())
    >                                            .field("message",
    "trying
    >                 out Elastic Search")
    >                                        .endObject()
    >                                      )
    >                            .execute()
    >                            .actionGet();
    >                        }
    >
    >
    >                 2. Check number of docs in index for query
    *:*
    >
    >                 curl -XGET
    >
    'http://localhost:9200/twitter/tweet/_count?q=*:*'
    >
    {"count":500,"_shards":{"total":5,"successful":5,"failed":0}}
    >
    >
    >
    >                 3. Update documents all docs (docs matched
    by *:*) by
    >                 50 docs chunks
    >
    >                 for (int from = 0; from < 500; from+=50) {
    >
    >                            SearchResponse response =
    >
    client.prepareSearch("twitter").setTypes("tweet").setSearchType(SearchType.QUERY_THEN_FETCH).setQuery(queryString("*:*")).setFrom(from).setSize(50).execute().actionGet();
    >
    >                            for (SearchHit searchHit :
    >                 response.hits().hits()) {
    >                                Map<String, Object> map =
    >                 searchHit.sourceAsMap();
    >                                map.put("message",
    "updated");
    >                                client.index(
    >
    >
    indexRequest("twitter").type("tweet").id(searchHit.getId()).source(map)).actionGet();
    >                            }
    >                        }
    >
    >
    >                 4. Check for number of updated docs:
    >
    >                 curl -XGET
    >
    'http://localhost:9200/twitter/tweet/_count?q=message:updated'
    >
    {"count":397,"_shards":{"total":5,"successful":5,"failed":0}}
    >
    >
    >                 As you can see only 397 documents got
    updated.
    >
    >
    >                 It seems 500 updates do occur, but some docs
    are
    >                 matched and updated
    >                 twice and others never get updated. I've
    added debug
    >                 code in update:
    >
    >                  Set<String> updatedDocs = new
    HashSet<String>();
    >
    >                        for (int from = 0; from < 500; from
    +=50) {
    >
    >                            SearchResponse response =
    >
    client.prepareSearch("twitter").setTypes("tweet").setSearchType(SearchType.QUERY_THEN_FETCH).setQuery(queryString("*:*")).setFrom(from).setSize(50).execute().actionGet();
    >
    >                            for (SearchHit searchHit :
    >                 response.hits().hits()) {
    >                                Map<String, Object> map =
    >                 searchHit.sourceAsMap();
    >                                map.put("message",
    "updated");
    >                                client.index(
    >
    >
    indexRequest("twitter").type("tweet").id(searchHit.getId()).source(map)).actionGet();
    >
    >                                //debug
    >
    >
     if(updatedDocs.contains(searchHit.getId())){
    >
     System.out.println("Already updated
    >                 doc, ID: " +
    >                 searchHit.getId());
    >                                }else{
    >
     updatedDocs.add(searchHit.getId());
    >                                }
    >                            }
    >                        }
    >
    >                 and got duplicates on output:
    >
    >                 Already updated doc, ID: 405
    >                 Already updated doc, ID: 406
    >                 Already updated doc, ID: 414
    >                 Already updated doc, ID: 413
    >                 Already updated doc, ID: 412
    >                 Already updated doc, ID: 419
    >
    >                 Also, one interesting thing is that number
    of actually
    >                 updated documents
    >                 seems to change from time to time for the
    same code.
    >                 For the example
    >                 above: first time only 300 got updated,
    second time
    >                 397 docs were
    >                 updated and on last try 350 docs (I've
    deleted local
    >                 data/* and
    >                 re-indexed between update tests).
    >
    >
    >                 Tomislav
    >
    >
    >
    >
    >
    >
    >
    >
    >
    >

(Tomislav Poljak) #7

Hi Shay,
maybe I didn't formulate my question well in previous post.

I understand that the scroll feature (which is like a cursor) can not
operate well on (mutithread) concurrent updates on the same document set
(one scroll doesn't 'see' other scroll's changes) and I understand that
sync of concurrent updates on the same data are always problematic. But,
since ES is near real time search engine, discussion on the best
practices regarding handling of concurrent (possibly large or bulk)
updates makes sense (from my point of view).

I've been testing update in form of using from/size iteration on the
latest ES snapshot (master built from source) and it seems, if there is
a sort present, all documents get updated (no need for updating
timestamp field and refreshing in each iteration). Is this correct?

So, if both scroll update approach and from/size iteration based
approach update all docs in a single thread (when sort is present) which
solution would you recommend for a large concurrent/mutithread updates:

  1. Scroll based approach, where app needs to make sure that the parallel
    scrolling threads do not have/share any common docs before users can
    execute them as parallel tasks. Maybe something like: if
    count(queryForUpdate1 AND queryForUpdate2 AND ..) = 0 then these can
    run in parallel, else user needs to wait)

or

  1. Sync updates on the 'update chunk' level in the from/size iteration
    based approach: each thread before it updates 'size' number of docs in
    each update step puts docIDs in the 'locked' map. Or if any of IDs were
    already present in 'locked' map, thread needs to wait for n seconds and
    check again (because this means other thread is working on/updating
    these docs). When other thread removes its 'lock', thread will be able
    to continue and update docs. After each 'size' chunk update thread will
    refresh index (so other threads can work on the new/update data). I
    think this approach can only work with the constraint that concurrent
    threads do not change each others sort or query fields values.

Any help would be appreciated,

Tomislav

On Wed, 2010-12-15 at 18:58 +0100, Tomislav Poljak wrote:

Hi,
I have additional question regarding update for a given query. We are
actually trying to build concurrent/multithread update solution where
any user can update any result set in a bulk update (for any given
query). When scroll is used (with sorting applied) and two scroll
updater threads try to operate on common/mutual documents, one thread
doesn't update all documents.

I have a question regarding the proposed scenario (from your response):

"... start from the back and more forward (while updating the timestamp
as well). Note that you will need to refresh after each bulk indexing to
"see" the latest updates"

Won't this approach also have problems in multithreaded/multiuser
environment where multiple users can issue concurrent update commands on
mutual/common documents? For example:

one update thread updates document's timestamp and other update thread
doesn't consider it for its update (which is a different update),
because it has recently been touched.

What would be the best/recommend approach for the large concurrent
updates (any good ideas :)?

Tomislav

On Wed, 2010-11-24 at 12:53 +0200, Shay Banon wrote:

Let me try and explain again, when you do the to/from walking with a
query, you might actually get docs that you already updated back and
update them again. You can try and filter out based on the updated
value the docs you want, or use timestamps as I suggested before to
make sure you only update docs you want.

On Wed, Nov 24, 2010 at 10:20 AM, Tomislav Poljak tpoljak@gmail.com
wrote:
Hi,
this is not full re-indexing, we only need to be able to
update all
documents matched by some/arbitrary "field value" query.

    For example update all docs where user is kimchy -> if
    queryString("user:kimchy") is used instead of *:* in bug
    reconstruction
    example, update still doesn't update all matching documents:
    
    curl -XGET
    'http://localhost:9200/twitter/tweet/_count?q=message:updated'
    
    {"count":370,"_shards":{"total":5,"successful":5,"failed":0}}
    
    Can you please advise what is currently recommended
    best/working way to
    update all docs where 'user' is 'kimchy'?
    
    
    Tomislav
    
    
    
    
    
    
    On Tue, 2010-11-23 at 23:35 +0200, Shay Banon wrote:
    > One more option, if you are doing a full reindexing, is to
    reindex
    > into a fresh index. This will be much faster since there
    won't be any
    > need to handle deletes and expunge them later on from the
    index.
    >
    > On Tue, Nov 23, 2010 at 11:34 PM, Shay Banon
    > <shay.banon@elasticsearch.com> wrote:
    >         Hi,
    >
    >
    >            The problem you have is the fact that there is no
    >         ordering guaranteed when doing match all query. What
    you would
    >         want to do is introduce some sort of ordering
    (timestamp for
    >         example). Then, you have two options, either start
    from the
    >         back and more forward (while updating the timestamp
    as well).
    >         Note that you will need to refresh after each bulk
    indexing to
    >         "see" the latest updates.
    >
    >
    >         -shay.banon
    >
    >
    >
    >         On Tue, Nov 23, 2010 at 6:05 PM, Tomislav Poljak
    >         <tpoljak@gmail.com> wrote:
    >                 Hi,
    >                 since scrolling is still broken (in all
    latest
    >                 versions, 0.13 included)
    >
    >
    https://github.com/elasticsearch/elasticsearch/issues#issue/136
    >
    >                 'from' parameter work-around seems to be
    recommend way
    >                 to go
    >
    >
    http://elasticsearch-users.115913.n3.nabble.com/totalHits-gets-changed-unexpectedly-while-scrolling-SearchResponse-td1575408.html
    >
    >                 But, it seems (to me) 'from' parameter
    work-around has
    >                 similar/same
    >                 problems when updating documents matched
    over
    >                 multiple/several shards.
    >
    >                 Here is the bug reconstruction:
    >
    >                 1. Index 500 docs (default index settings, 5
    shards)
    >
    >                 for (int i = 0; i < 500; i++) {
    >                            IndexResponse response =
    >                 client.prepareIndex("twitter", "tweet",
    >                 Integer.toString(i))
    >                            .setSource(jsonBuilder()
    >                                        .startObject()
    >                                            .field("user",
    "kimchy")
    >
     .field("postDate", new
    >                 Date())
    >                                            .field("message",
    "trying
    >                 out Elastic Search")
    >                                        .endObject()
    >                                      )
    >                            .execute()
    >                            .actionGet();
    >                        }
    >
    >
    >                 2. Check number of docs in index for query
    *:*
    >
    >                 curl -XGET
    >
    'http://localhost:9200/twitter/tweet/_count?q=*:*'
    >
    {"count":500,"_shards":{"total":5,"successful":5,"failed":0}}
    >
    >
    >
    >                 3. Update documents all docs (docs matched
    by *:*) by
    >                 50 docs chunks
    >
    >                 for (int from = 0; from < 500; from+=50) {
    >
    >                            SearchResponse response =
    >
    client.prepareSearch("twitter").setTypes("tweet").setSearchType(SearchType.QUERY_THEN_FETCH).setQuery(queryString("*:*")).setFrom(from).setSize(50).execute().actionGet();
    >
    >                            for (SearchHit searchHit :
    >                 response.hits().hits()) {
    >                                Map<String, Object> map =
    >                 searchHit.sourceAsMap();
    >                                map.put("message",
    "updated");
    >                                client.index(
    >
    >
    indexRequest("twitter").type("tweet").id(searchHit.getId()).source(map)).actionGet();
    >                            }
    >                        }
    >
    >
    >                 4. Check for number of updated docs:
    >
    >                 curl -XGET
    >
    'http://localhost:9200/twitter/tweet/_count?q=message:updated'
    >
    {"count":397,"_shards":{"total":5,"successful":5,"failed":0}}
    >
    >
    >                 As you can see only 397 documents got
    updated.
    >
    >
    >                 It seems 500 updates do occur, but some docs
    are
    >                 matched and updated
    >                 twice and others never get updated. I've
    added debug
    >                 code in update:
    >
    >                  Set<String> updatedDocs = new
    HashSet<String>();
    >
    >                        for (int from = 0; from < 500; from
    +=50) {
    >
    >                            SearchResponse response =
    >
    client.prepareSearch("twitter").setTypes("tweet").setSearchType(SearchType.QUERY_THEN_FETCH).setQuery(queryString("*:*")).setFrom(from).setSize(50).execute().actionGet();
    >
    >                            for (SearchHit searchHit :
    >                 response.hits().hits()) {
    >                                Map<String, Object> map =
    >                 searchHit.sourceAsMap();
    >                                map.put("message",
    "updated");
    >                                client.index(
    >
    >
    indexRequest("twitter").type("tweet").id(searchHit.getId()).source(map)).actionGet();
    >
    >                                //debug
    >
    >
     if(updatedDocs.contains(searchHit.getId())){
    >
     System.out.println("Already updated
    >                 doc, ID: " +
    >                 searchHit.getId());
    >                                }else{
    >
     updatedDocs.add(searchHit.getId());
    >                                }
    >                            }
    >                        }
    >
    >                 and got duplicates on output:
    >
    >                 Already updated doc, ID: 405
    >                 Already updated doc, ID: 406
    >                 Already updated doc, ID: 414
    >                 Already updated doc, ID: 413
    >                 Already updated doc, ID: 412
    >                 Already updated doc, ID: 419
    >
    >                 Also, one interesting thing is that number
    of actually
    >                 updated documents
    >                 seems to change from time to time for the
    same code.
    >                 For the example
    >                 above: first time only 300 got updated,
    second time
    >                 397 docs were
    >                 updated and on last try 350 docs (I've
    deleted local
    >                 data/* and
    >                 re-indexed between update tests).
    >
    >
    >                 Tomislav
    >
    >
    >
    >
    >
    >
    >
    >
    >
    >

(Shay Banon) #8

The documents that will get updated will depend on when you opened the
scrolling search request to start navigating the data. Regarding concurrent
updates, you will need to handle it yourself (and its not a simple one to
handle..) since ES does not provide things like transactions and locking
(which you might use in database based systems). Your options are sound,
note that if you get into updates form different processes completely, you
might run into problems with them unless each process only handles a
specific portion of the data.

In the future, ES can provide something similar to optimistic locking
control (per document), and then it should be simpler to implements
something like that, since you will get a failure when trying to update a
document that has already been updated since you last read it.

On Thu, Dec 16, 2010 at 10:34 PM, Tomislav Poljak tpoljak@gmail.com wrote:

Hi Shay,
maybe I didn't formulate my question well in previous post.

I understand that the scroll feature (which is like a cursor) can not
operate well on (mutithread) concurrent updates on the same document set
(one scroll doesn't 'see' other scroll's changes) and I understand that
sync of concurrent updates on the same data are always problematic. But,
since ES is near real time search engine, discussion on the best
practices regarding handling of concurrent (possibly large or bulk)
updates makes sense (from my point of view).

I've been testing update in form of using from/size iteration on the
latest ES snapshot (master built from source) and it seems, if there is
a sort present, all documents get updated (no need for updating
timestamp field and refreshing in each iteration). Is this correct?

So, if both scroll update approach and from/size iteration based
approach update all docs in a single thread (when sort is present) which
solution would you recommend for a large concurrent/mutithread updates:

  1. Scroll based approach, where app needs to make sure that the parallel
    scrolling threads do not have/share any common docs before users can
    execute them as parallel tasks. Maybe something like: if
    count(queryForUpdate1 AND queryForUpdate2 AND ..) = 0 then these can
    run in parallel, else user needs to wait)

or

  1. Sync updates on the 'update chunk' level in the from/size iteration
    based approach: each thread before it updates 'size' number of docs in
    each update step puts docIDs in the 'locked' map. Or if any of IDs were
    already present in 'locked' map, thread needs to wait for n seconds and
    check again (because this means other thread is working on/updating
    these docs). When other thread removes its 'lock', thread will be able
    to continue and update docs. After each 'size' chunk update thread will
    refresh index (so other threads can work on the new/update data). I
    think this approach can only work with the constraint that concurrent
    threads do not change each others sort or query fields values.

Any help would be appreciated,

Tomislav

On Wed, 2010-12-15 at 18:58 +0100, Tomislav Poljak wrote:

Hi,
I have additional question regarding update for a given query. We are
actually trying to build concurrent/multithread update solution where
any user can update any result set in a bulk update (for any given
query). When scroll is used (with sorting applied) and two scroll
updater threads try to operate on common/mutual documents, one thread
doesn't update all documents.

I have a question regarding the proposed scenario (from your response):

"... start from the back and more forward (while updating the timestamp
as well). Note that you will need to refresh after each bulk indexing to
"see" the latest updates"

Won't this approach also have problems in multithreaded/multiuser
environment where multiple users can issue concurrent update commands on
mutual/common documents? For example:

one update thread updates document's timestamp and other update thread
doesn't consider it for its update (which is a different update),
because it has recently been touched.

What would be the best/recommend approach for the large concurrent
updates (any good ideas :)?

Tomislav

On Wed, 2010-11-24 at 12:53 +0200, Shay Banon wrote:

Let me try and explain again, when you do the to/from walking with a
query, you might actually get docs that you already updated back and
update them again. You can try and filter out based on the updated
value the docs you want, or use timestamps as I suggested before to
make sure you only update docs you want.

On Wed, Nov 24, 2010 at 10:20 AM, Tomislav Poljak tpoljak@gmail.com
wrote:
Hi,
this is not full re-indexing, we only need to be able to
update all
documents matched by some/arbitrary "field value" query.

    For example update all docs where user is kimchy -> if
    queryString("user:kimchy") is used instead of *:* in bug
    reconstruction
    example, update still doesn't update all matching documents:

    curl -XGET
    'http://localhost:9200/twitter/tweet/_count?q=message:updated'

    {"count":370,"_shards":{"total":5,"successful":5,"failed":0}}

    Can you please advise what is currently recommended
    best/working way to
    update all docs where 'user' is 'kimchy'?


    Tomislav






    On Tue, 2010-11-23 at 23:35 +0200, Shay Banon wrote:
    > One more option, if you are doing a full reindexing, is to
    reindex
    > into a fresh index. This will be much faster since there
    won't be any
    > need to handle deletes and expunge them later on from the
    index.
    >
    > On Tue, Nov 23, 2010 at 11:34 PM, Shay Banon
    > <shay.banon@elasticsearch.com> wrote:
    >         Hi,
    >
    >
    >            The problem you have is the fact that there is no
    >         ordering guaranteed when doing match all query. What
    you would
    >         want to do is introduce some sort of ordering
    (timestamp for
    >         example). Then, you have two options, either start
    from the
    >         back and more forward (while updating the timestamp
    as well).
    >         Note that you will need to refresh after each bulk
    indexing to
    >         "see" the latest updates.
    >
    >
    >         -shay.banon
    >
    >
    >
    >         On Tue, Nov 23, 2010 at 6:05 PM, Tomislav Poljak
    >         <tpoljak@gmail.com> wrote:
    >                 Hi,
    >                 since scrolling is still broken (in all
    latest
    >                 versions, 0.13 included)
    >
    >

https://github.com/elasticsearch/elasticsearch/issues#issue/136

    >
    >                 'from' parameter work-around seems to be
    recommend way
    >                 to go
    >
    >

http://elasticsearch-users.115913.n3.nabble.com/totalHits-gets-changed-unexpectedly-while-scrolling-SearchResponse-td1575408.html

    >
    >                 But, it seems (to me) 'from' parameter
    work-around has
    >                 similar/same
    >                 problems when updating documents matched
    over
    >                 multiple/several shards.
    >
    >                 Here is the bug reconstruction:
    >
    >                 1. Index 500 docs (default index settings, 5
    shards)
    >
    >                 for (int i = 0; i < 500; i++) {
    >                            IndexResponse response =
    >                 client.prepareIndex("twitter", "tweet",
    >                 Integer.toString(i))
    >                            .setSource(jsonBuilder()
    >                                        .startObject()
    >                                            .field("user",
    "kimchy")
    >
     .field("postDate", new
    >                 Date())
    >                                            .field("message",
    "trying
    >                 out Elastic Search")
    >                                        .endObject()
    >                                      )
    >                            .execute()
    >                            .actionGet();
    >                        }
    >
    >
    >                 2. Check number of docs in index for query
    *:*
    >
    >                 curl -XGET
    >
    'http://localhost:9200/twitter/tweet/_count?q=*:*'
    >
    {"count":500,"_shards":{"total":5,"successful":5,"failed":0}}
    >
    >
    >
    >                 3. Update documents all docs (docs matched
    by *:*) by
    >                 50 docs chunks
    >
    >                 for (int from = 0; from < 500; from+=50) {
    >
    >                            SearchResponse response =
    >

client.prepareSearch("twitter").setTypes("tweet").setSearchType(SearchType.QUERY_THEN_FETCH).setQuery(queryString(":")).setFrom(from).setSize(50).execute().actionGet();

    >
    >                            for (SearchHit searchHit :
    >                 response.hits().hits()) {
    >                                Map<String, Object> map =
    >                 searchHit.sourceAsMap();
    >                                map.put("message",
    "updated");
    >                                client.index(
    >
    >

indexRequest("twitter").type("tweet").id(searchHit.getId()).source(map)).actionGet();

    >                            }
    >                        }
    >
    >
    >                 4. Check for number of updated docs:
    >
    >                 curl -XGET
    >
    'http://localhost:9200/twitter/tweet/_count?q=message:updated'
    >
    {"count":397,"_shards":{"total":5,"successful":5,"failed":0}}
    >
    >
    >                 As you can see only 397 documents got
    updated.
    >
    >
    >                 It seems 500 updates do occur, but some docs
    are
    >                 matched and updated
    >                 twice and others never get updated. I've
    added debug
    >                 code in update:
    >
    >                  Set<String> updatedDocs = new
    HashSet<String>();
    >
    >                        for (int from = 0; from < 500; from
    +=50) {
    >
    >                            SearchResponse response =
    >

client.prepareSearch("twitter").setTypes("tweet").setSearchType(SearchType.QUERY_THEN_FETCH).setQuery(queryString(":")).setFrom(from).setSize(50).execute().actionGet();

    >
    >                            for (SearchHit searchHit :
    >                 response.hits().hits()) {
    >                                Map<String, Object> map =
    >                 searchHit.sourceAsMap();
    >                                map.put("message",
    "updated");
    >                                client.index(
    >
    >

indexRequest("twitter").type("tweet").id(searchHit.getId()).source(map)).actionGet();

    >
    >                                //debug
    >
    >
     if(updatedDocs.contains(searchHit.getId())){
    >
     System.out.println("Already updated
    >                 doc, ID: " +
    >                 searchHit.getId());
    >                                }else{
    >
     updatedDocs.add(searchHit.getId());
    >                                }
    >                            }
    >                        }
    >
    >                 and got duplicates on output:
    >
    >                 Already updated doc, ID: 405
    >                 Already updated doc, ID: 406
    >                 Already updated doc, ID: 414
    >                 Already updated doc, ID: 413
    >                 Already updated doc, ID: 412
    >                 Already updated doc, ID: 419
    >
    >                 Also, one interesting thing is that number
    of actually
    >                 updated documents
    >                 seems to change from time to time for the
    same code.
    >                 For the example
    >                 above: first time only 300 got updated,
    second time
    >                 397 docs were
    >                 updated and on last try 350 docs (I've
    deleted local
    >                 data/* and
    >                 re-indexed between update tests).
    >
    >
    >                 Tomislav
    >
    >
    >
    >
    >
    >
    >
    >
    >
    >

(system) #9