Work-around for the scroll issue doesn't seems to work propertly (also)

Hi,
this is not full re-indexing, we only need to be able to update all
documents matched by some/arbitrary "field value" query.

For example update all docs where user is kimchy -> if
queryString("user:kimchy") is used instead of : in bug reconstruction
example, update still doesn't update all matching documents:

curl -XGET
'http://localhost:9200/twitter/tweet/_count?q=message:updated'
{"count":370,"_shards":{"total":5,"successful":5,"failed":0}}

Can you please advise what is currently recommended best/working way to
update all docs where 'user' is 'kimchy'?

Tomislav

On Tue, 2010-11-23 at 23:35 +0200, Shay Banon wrote:

One more option, if you are doing a full reindexing, is to reindex
into a fresh index. This will be much faster since there won't be any
need to handle deletes and expunge them later on from the index.

On Tue, Nov 23, 2010 at 11:34 PM, Shay Banon
shay.banon@elasticsearch.com wrote:
Hi,

       The problem you have is the fact that there is no
    ordering guaranteed when doing match all query. What you would
    want to do is introduce some sort of ordering (timestamp for
    example). Then, you have two options, either start from the
    back and more forward (while updating the timestamp as well).
    Note that you will need to refresh after each bulk indexing to
    "see" the latest updates.
    
    
    -shay.banon
    
    
    
    On Tue, Nov 23, 2010 at 6:05 PM, Tomislav Poljak
    <tpoljak@gmail.com> wrote:
            Hi,
            since scrolling is still broken (in all latest
            versions, 0.13 included)
            
            https://github.com/elasticsearch/elasticsearch/issues#issue/136
            
            'from' parameter work-around seems to be recommend way
            to go
            
            http://elasticsearch-users.115913.n3.nabble.com/totalHits-gets-changed-unexpectedly-while-scrolling-SearchResponse-td1575408.html
            
            But, it seems (to me) 'from' parameter work-around has
            similar/same
            problems when updating documents matched over
            multiple/several shards.
            
            Here is the bug reconstruction:
            
            1. Index 500 docs (default index settings, 5 shards)
            
            for (int i = 0; i < 500; i++) {
                       IndexResponse response =
            client.prepareIndex("twitter", "tweet",
            Integer.toString(i))
                       .setSource(jsonBuilder()
                                   .startObject()
                                       .field("user", "kimchy")
                                       .field("postDate", new
            Date())
                                       .field("message", "trying
            out Elastic Search")
                                   .endObject()
                                 )
                       .execute()
                       .actionGet();
                   }
            
            
            2. Check number of docs in index for query *:*
            
            curl -XGET
            'http://localhost:9200/twitter/tweet/_count?q=*:*'
            {"count":500,"_shards":{"total":5,"successful":5,"failed":0}}
            
            
            
            3. Update documents all docs (docs matched by *:*) by
            50 docs chunks
            
            for (int from = 0; from < 500; from+=50) {
            
                       SearchResponse response =
            client.prepareSearch("twitter").setTypes("tweet").setSearchType(SearchType.QUERY_THEN_FETCH).setQuery(queryString("*:*")).setFrom(from).setSize(50).execute().actionGet();
            
                       for (SearchHit searchHit :
            response.hits().hits()) {
                           Map<String, Object> map =
            searchHit.sourceAsMap();
                           map.put("message", "updated");
                           client.index(
            
            indexRequest("twitter").type("tweet").id(searchHit.getId()).source(map)).actionGet();
                       }
                   }
            
            
            4. Check for number of updated docs:
            
            curl -XGET
            'http://localhost:9200/twitter/tweet/_count?q=message:updated'
            {"count":397,"_shards":{"total":5,"successful":5,"failed":0}}
            
            
            As you can see only 397 documents got updated.
            
            
            It seems 500 updates do occur, but some docs are
            matched and updated
            twice and others never get updated. I've added debug
            code in update:
            
             Set<String> updatedDocs = new HashSet<String>();
            
                   for (int from = 0; from < 500; from+=50) {
            
                       SearchResponse response =
            client.prepareSearch("twitter").setTypes("tweet").setSearchType(SearchType.QUERY_THEN_FETCH).setQuery(queryString("*:*")).setFrom(from).setSize(50).execute().actionGet();
            
                       for (SearchHit searchHit :
            response.hits().hits()) {
                           Map<String, Object> map =
            searchHit.sourceAsMap();
                           map.put("message", "updated");
                           client.index(
            
            indexRequest("twitter").type("tweet").id(searchHit.getId()).source(map)).actionGet();
            
                           //debug
            
             if(updatedDocs.contains(searchHit.getId())){
                               System.out.println("Already updated
            doc, ID: " +
            searchHit.getId());
                           }else{
                               updatedDocs.add(searchHit.getId());
                           }
                       }
                   }
            
            and got duplicates on output:
            
            Already updated doc, ID: 405
            Already updated doc, ID: 406
            Already updated doc, ID: 414
            Already updated doc, ID: 413
            Already updated doc, ID: 412
            Already updated doc, ID: 419
            
            Also, one interesting thing is that number of actually
            updated documents
            seems to change from time to time for the same code.
            For the example
            above: first time only 300 got updated, second time
            397 docs were
            updated and on last try 350 docs (I've deleted local
            data/* and
            re-indexed between update tests).
            
            
            Tomislav