Work-around for the scroll issue doesn't seems to work propertly (also)

Hi,
I have additional question regarding update for a given query. We are
actually trying to build concurrent/multithread update solution where
any user can update any result set in a bulk update (for any given
query). When scroll is used (with sorting applied) and two scroll
updater threads try to operate on common/mutual documents, one thread
doesn't update all documents.

I have a question regarding the proposed scenario (from your response):

"... start from the back and more forward (while updating the timestamp
as well). Note that you will need to refresh after each bulk indexing to
"see" the latest updates"

Won't this approach also have problems in multithreaded/multiuser
environment where multiple users can issue concurrent update commands on
mutual/common documents? For example:

one update thread updates document's timestamp and other update thread
doesn't consider it for its update (which is a different update),
because it has recently been touched.

What would be the best/recommend approach for the large concurrent
updates (any good ideas :)?

Tomislav

On Wed, 2010-11-24 at 12:53 +0200, Shay Banon wrote:

Let me try and explain again, when you do the to/from walking with a
query, you might actually get docs that you already updated back and
update them again. You can try and filter out based on the updated
value the docs you want, or use timestamps as I suggested before to
make sure you only update docs you want.

On Wed, Nov 24, 2010 at 10:20 AM, Tomislav Poljak tpoljak@gmail.com
wrote:
Hi,
this is not full re-indexing, we only need to be able to
update all
documents matched by some/arbitrary "field value" query.

    For example update all docs where user is kimchy -> if
    queryString("user:kimchy") is used instead of *:* in bug
    reconstruction
    example, update still doesn't update all matching documents:
    
    curl -XGET
    'http://localhost:9200/twitter/tweet/_count?q=message:updated'
    
    {"count":370,"_shards":{"total":5,"successful":5,"failed":0}}
    
    Can you please advise what is currently recommended
    best/working way to
    update all docs where 'user' is 'kimchy'?
    
    
    Tomislav
    
    
    
    
    
    
    On Tue, 2010-11-23 at 23:35 +0200, Shay Banon wrote:
    > One more option, if you are doing a full reindexing, is to
    reindex
    > into a fresh index. This will be much faster since there
    won't be any
    > need to handle deletes and expunge them later on from the
    index.
    >
    > On Tue, Nov 23, 2010 at 11:34 PM, Shay Banon
    > <shay.banon@elasticsearch.com> wrote:
    >         Hi,
    >
    >
    >            The problem you have is the fact that there is no
    >         ordering guaranteed when doing match all query. What
    you would
    >         want to do is introduce some sort of ordering
    (timestamp for
    >         example). Then, you have two options, either start
    from the
    >         back and more forward (while updating the timestamp
    as well).
    >         Note that you will need to refresh after each bulk
    indexing to
    >         "see" the latest updates.
    >
    >
    >         -shay.banon
    >
    >
    >
    >         On Tue, Nov 23, 2010 at 6:05 PM, Tomislav Poljak
    >         <tpoljak@gmail.com> wrote:
    >                 Hi,
    >                 since scrolling is still broken (in all
    latest
    >                 versions, 0.13 included)
    >
    >
    https://github.com/elasticsearch/elasticsearch/issues#issue/136
    >
    >                 'from' parameter work-around seems to be
    recommend way
    >                 to go
    >
    >
    http://elasticsearch-users.115913.n3.nabble.com/totalHits-gets-changed-unexpectedly-while-scrolling-SearchResponse-td1575408.html
    >
    >                 But, it seems (to me) 'from' parameter
    work-around has
    >                 similar/same
    >                 problems when updating documents matched
    over
    >                 multiple/several shards.
    >
    >                 Here is the bug reconstruction:
    >
    >                 1. Index 500 docs (default index settings, 5
    shards)
    >
    >                 for (int i = 0; i < 500; i++) {
    >                            IndexResponse response =
    >                 client.prepareIndex("twitter", "tweet",
    >                 Integer.toString(i))
    >                            .setSource(jsonBuilder()
    >                                        .startObject()
    >                                            .field("user",
    "kimchy")
    >
     .field("postDate", new
    >                 Date())
    >                                            .field("message",
    "trying
    >                 out Elastic Search")
    >                                        .endObject()
    >                                      )
    >                            .execute()
    >                            .actionGet();
    >                        }
    >
    >
    >                 2. Check number of docs in index for query
    *:*
    >
    >                 curl -XGET
    >
    'http://localhost:9200/twitter/tweet/_count?q=*:*'
    >
    {"count":500,"_shards":{"total":5,"successful":5,"failed":0}}
    >
    >
    >
    >                 3. Update documents all docs (docs matched
    by *:*) by
    >                 50 docs chunks
    >
    >                 for (int from = 0; from < 500; from+=50) {
    >
    >                            SearchResponse response =
    >
    client.prepareSearch("twitter").setTypes("tweet").setSearchType(SearchType.QUERY_THEN_FETCH).setQuery(queryString("*:*")).setFrom(from).setSize(50).execute().actionGet();
    >
    >                            for (SearchHit searchHit :
    >                 response.hits().hits()) {
    >                                Map<String, Object> map =
    >                 searchHit.sourceAsMap();
    >                                map.put("message",
    "updated");
    >                                client.index(
    >
    >
    indexRequest("twitter").type("tweet").id(searchHit.getId()).source(map)).actionGet();
    >                            }
    >                        }
    >
    >
    >                 4. Check for number of updated docs:
    >
    >                 curl -XGET
    >
    'http://localhost:9200/twitter/tweet/_count?q=message:updated'
    >
    {"count":397,"_shards":{"total":5,"successful":5,"failed":0}}
    >
    >
    >                 As you can see only 397 documents got
    updated.
    >
    >
    >                 It seems 500 updates do occur, but some docs
    are
    >                 matched and updated
    >                 twice and others never get updated. I've
    added debug
    >                 code in update:
    >
    >                  Set<String> updatedDocs = new
    HashSet<String>();
    >
    >                        for (int from = 0; from < 500; from
    +=50) {
    >
    >                            SearchResponse response =
    >
    client.prepareSearch("twitter").setTypes("tweet").setSearchType(SearchType.QUERY_THEN_FETCH).setQuery(queryString("*:*")).setFrom(from).setSize(50).execute().actionGet();
    >
    >                            for (SearchHit searchHit :
    >                 response.hits().hits()) {
    >                                Map<String, Object> map =
    >                 searchHit.sourceAsMap();
    >                                map.put("message",
    "updated");
    >                                client.index(
    >
    >
    indexRequest("twitter").type("tweet").id(searchHit.getId()).source(map)).actionGet();
    >
    >                                //debug
    >
    >
     if(updatedDocs.contains(searchHit.getId())){
    >
     System.out.println("Already updated
    >                 doc, ID: " +
    >                 searchHit.getId());
    >                                }else{
    >
     updatedDocs.add(searchHit.getId());
    >                                }
    >                            }
    >                        }
    >
    >                 and got duplicates on output:
    >
    >                 Already updated doc, ID: 405
    >                 Already updated doc, ID: 406
    >                 Already updated doc, ID: 414
    >                 Already updated doc, ID: 413
    >                 Already updated doc, ID: 412
    >                 Already updated doc, ID: 419
    >
    >                 Also, one interesting thing is that number
    of actually
    >                 updated documents
    >                 seems to change from time to time for the
    same code.
    >                 For the example
    >                 above: first time only 300 got updated,
    second time
    >                 397 docs were
    >                 updated and on last try 350 docs (I've
    deleted local
    >                 data/* and
    >                 re-indexed between update tests).
    >
    >
    >                 Tomislav
    >
    >
    >
    >
    >
    >
    >
    >
    >
    >