Hi,
since scrolling is still broken (in all latest versions, 0.13 included)
'from' parameter work-around seems to be recommend way to go
But, it seems (to me) 'from' parameter work-around has similar/same
problems when updating documents matched over multiple/several shards.
Here is the bug reconstruction:
- Index 500 docs (default index settings, 5 shards)
for (int i = 0; i < 500; i++) {
IndexResponse response = client.prepareIndex("twitter", "tweet", Integer.toString(i))
.setSource(jsonBuilder()
.startObject()
.field("user", "kimchy")
.field("postDate", new Date())
.field("message", "trying out Elastic Search")
.endObject()
)
.execute()
.actionGet();
}
- Check number of docs in index for query :
curl -XGET 'http://localhost:9200/twitter/tweet/_count?q=:'
{"count":500,"_shards":{"total":5,"successful":5,"failed":0}}
- Update documents all docs (docs matched by :) by 50 docs chunks
for (int from = 0; from < 500; from+=50) {
SearchResponse response =
client.prepareSearch("twitter").setTypes("tweet").setSearchType(SearchType.QUERY_THEN_FETCH).setQuery(queryString(":")).setFrom(from).setSize(50).execute().actionGet();
for (SearchHit searchHit : response.hits().hits()) {
Map<String, Object> map = searchHit.sourceAsMap();
map.put("message", "updated");
client.index(
indexRequest("twitter").type("tweet").id(searchHit.getId()).source(map)).actionGet();
}
}
- Check for number of updated docs:
curl -XGET
'http://localhost:9200/twitter/tweet/_count?q=message:updated'
{"count":397,"_shards":{"total":5,"successful":5,"failed":0}}
As you can see only 397 documents got updated.
It seems 500 updates do occur, but some docs are matched and updated
twice and others never get updated. I've added debug code in update:
Set updatedDocs = new HashSet();
for (int from = 0; from < 500; from+=50) {
SearchResponse response =
client.prepareSearch("twitter").setTypes("tweet").setSearchType(SearchType.QUERY_THEN_FETCH).setQuery(queryString(":")).setFrom(from).setSize(50).execute().actionGet();
for (SearchHit searchHit : response.hits().hits()) {
Map<String, Object> map = searchHit.sourceAsMap();
map.put("message", "updated");
client.index(
indexRequest("twitter").type("tweet").id(searchHit.getId()).source(map)).actionGet();
//debug
if(updatedDocs.contains(searchHit.getId())){
System.out.println("Already updated doc, ID: " +
searchHit.getId());
}else{
updatedDocs.add(searchHit.getId());
}
}
}
and got duplicates on output:
Already updated doc, ID: 405
Already updated doc, ID: 406
Already updated doc, ID: 414
Already updated doc, ID: 413
Already updated doc, ID: 412
Already updated doc, ID: 419
Also, one interesting thing is that number of actually updated documents
seems to change from time to time for the same code. For the example
above: first time only 300 got updated, second time 397 docs were
updated and on last try 350 docs (I've deleted local data/* and
re-indexed between update tests).
Tomislav