Hi,
this is not full re-indexing, we only need to be able to update all
documents matched by some/arbitrary "field value" query.
For example update all docs where user is kimchy -> if
queryString("user:kimchy") is used instead of : in bug reconstruction
example, update still doesn't update all matching documents:
curl -XGET
'http://localhost:9200/twitter/tweet/_count?q=message:updated'
{"count":370,"_shards":{"total":5,"successful":5,"failed":0}}
Can you please advise what is currently recommended best/working way to
update all docs where 'user' is 'kimchy'?
Tomislav
On Tue, 2010-11-23 at 23:35 +0200, Shay Banon wrote:
One more option, if you are doing a full reindexing, is to reindex
into a fresh index. This will be much faster since there won't be any
need to handle deletes and expunge them later on from the index.On Tue, Nov 23, 2010 at 11:34 PM, Shay Banon
shay.banon@elasticsearch.com wrote:
Hi,The problem you have is the fact that there is no ordering guaranteed when doing match all query. What you would want to do is introduce some sort of ordering (timestamp for example). Then, you have two options, either start from the back and more forward (while updating the timestamp as well). Note that you will need to refresh after each bulk indexing to "see" the latest updates. -shay.banon On Tue, Nov 23, 2010 at 6:05 PM, Tomislav Poljak <tpoljak@gmail.com> wrote: Hi, since scrolling is still broken (in all latest versions, 0.13 included) https://github.com/elasticsearch/elasticsearch/issues#issue/136 'from' parameter work-around seems to be recommend way to go http://elasticsearch-users.115913.n3.nabble.com/totalHits-gets-changed-unexpectedly-while-scrolling-SearchResponse-td1575408.html But, it seems (to me) 'from' parameter work-around has similar/same problems when updating documents matched over multiple/several shards. Here is the bug reconstruction: 1. Index 500 docs (default index settings, 5 shards) for (int i = 0; i < 500; i++) { IndexResponse response = client.prepareIndex("twitter", "tweet", Integer.toString(i)) .setSource(jsonBuilder() .startObject() .field("user", "kimchy") .field("postDate", new Date()) .field("message", "trying out Elastic Search") .endObject() ) .execute() .actionGet(); } 2. Check number of docs in index for query *:* curl -XGET 'http://localhost:9200/twitter/tweet/_count?q=*:*' {"count":500,"_shards":{"total":5,"successful":5,"failed":0}} 3. Update documents all docs (docs matched by *:*) by 50 docs chunks for (int from = 0; from < 500; from+=50) { SearchResponse response = client.prepareSearch("twitter").setTypes("tweet").setSearchType(SearchType.QUERY_THEN_FETCH).setQuery(queryString("*:*")).setFrom(from).setSize(50).execute().actionGet(); for (SearchHit searchHit : response.hits().hits()) { Map<String, Object> map = searchHit.sourceAsMap(); map.put("message", "updated"); client.index( indexRequest("twitter").type("tweet").id(searchHit.getId()).source(map)).actionGet(); } } 4. Check for number of updated docs: curl -XGET 'http://localhost:9200/twitter/tweet/_count?q=message:updated' {"count":397,"_shards":{"total":5,"successful":5,"failed":0}} As you can see only 397 documents got updated. It seems 500 updates do occur, but some docs are matched and updated twice and others never get updated. I've added debug code in update: Set<String> updatedDocs = new HashSet<String>(); for (int from = 0; from < 500; from+=50) { SearchResponse response = client.prepareSearch("twitter").setTypes("tweet").setSearchType(SearchType.QUERY_THEN_FETCH).setQuery(queryString("*:*")).setFrom(from).setSize(50).execute().actionGet(); for (SearchHit searchHit : response.hits().hits()) { Map<String, Object> map = searchHit.sourceAsMap(); map.put("message", "updated"); client.index( indexRequest("twitter").type("tweet").id(searchHit.getId()).source(map)).actionGet(); //debug if(updatedDocs.contains(searchHit.getId())){ System.out.println("Already updated doc, ID: " + searchHit.getId()); }else{ updatedDocs.add(searchHit.getId()); } } } and got duplicates on output: Already updated doc, ID: 405 Already updated doc, ID: 406 Already updated doc, ID: 414 Already updated doc, ID: 413 Already updated doc, ID: 412 Already updated doc, ID: 419 Also, one interesting thing is that number of actually updated documents seems to change from time to time for the same code. For the example above: first time only 300 got updated, second time 397 docs were updated and on last try 350 docs (I've deleted local data/* and re-indexed between update tests). Tomislav