Hi,
I have additional question regarding update for a given query. We are
actually trying to build concurrent/multithread update solution where
any user can update any result set in a bulk update (for any given
query). When scroll is used (with sorting applied) and two scroll
updater threads try to operate on common/mutual documents, one thread
doesn't update all documents.
I have a question regarding the proposed scenario (from your response):
"... start from the back and more forward (while updating the timestamp
as well). Note that you will need to refresh after each bulk indexing to
"see" the latest updates"
Won't this approach also have problems in multithreaded/multiuser
environment where multiple users can issue concurrent update commands on
mutual/common documents? For example:
one update thread updates document's timestamp and other update thread
doesn't consider it for its update (which is a different update),
because it has recently been touched.
What would be the best/recommend approach for the large concurrent
updates (any good ideas :)?
Tomislav
On Wed, 2010-11-24 at 12:53 +0200, Shay Banon wrote:
Let me try and explain again, when you do the to/from walking with a
query, you might actually get docs that you already updated back and
update them again. You can try and filter out based on the updated
value the docs you want, or use timestamps as I suggested before to
make sure you only update docs you want.On Wed, Nov 24, 2010 at 10:20 AM, Tomislav Poljak tpoljak@gmail.com
wrote:
Hi,
this is not full re-indexing, we only need to be able to
update all
documents matched by some/arbitrary "field value" query.For example update all docs where user is kimchy -> if queryString("user:kimchy") is used instead of *:* in bug reconstruction example, update still doesn't update all matching documents: curl -XGET 'http://localhost:9200/twitter/tweet/_count?q=message:updated' {"count":370,"_shards":{"total":5,"successful":5,"failed":0}} Can you please advise what is currently recommended best/working way to update all docs where 'user' is 'kimchy'? Tomislav On Tue, 2010-11-23 at 23:35 +0200, Shay Banon wrote: > One more option, if you are doing a full reindexing, is to reindex > into a fresh index. This will be much faster since there won't be any > need to handle deletes and expunge them later on from the index. > > On Tue, Nov 23, 2010 at 11:34 PM, Shay Banon > <shay.banon@elasticsearch.com> wrote: > Hi, > > > The problem you have is the fact that there is no > ordering guaranteed when doing match all query. What you would > want to do is introduce some sort of ordering (timestamp for > example). Then, you have two options, either start from the > back and more forward (while updating the timestamp as well). > Note that you will need to refresh after each bulk indexing to > "see" the latest updates. > > > -shay.banon > > > > On Tue, Nov 23, 2010 at 6:05 PM, Tomislav Poljak > <tpoljak@gmail.com> wrote: > Hi, > since scrolling is still broken (in all latest > versions, 0.13 included) > > https://github.com/elasticsearch/elasticsearch/issues#issue/136 > > 'from' parameter work-around seems to be recommend way > to go > > http://elasticsearch-users.115913.n3.nabble.com/totalHits-gets-changed-unexpectedly-while-scrolling-SearchResponse-td1575408.html > > But, it seems (to me) 'from' parameter work-around has > similar/same > problems when updating documents matched over > multiple/several shards. > > Here is the bug reconstruction: > > 1. Index 500 docs (default index settings, 5 shards) > > for (int i = 0; i < 500; i++) { > IndexResponse response = > client.prepareIndex("twitter", "tweet", > Integer.toString(i)) > .setSource(jsonBuilder() > .startObject() > .field("user", "kimchy") > .field("postDate", new > Date()) > .field("message", "trying > out Elastic Search") > .endObject() > ) > .execute() > .actionGet(); > } > > > 2. Check number of docs in index for query *:* > > curl -XGET > 'http://localhost:9200/twitter/tweet/_count?q=*:*' > {"count":500,"_shards":{"total":5,"successful":5,"failed":0}} > > > > 3. Update documents all docs (docs matched by *:*) by > 50 docs chunks > > for (int from = 0; from < 500; from+=50) { > > SearchResponse response = > client.prepareSearch("twitter").setTypes("tweet").setSearchType(SearchType.QUERY_THEN_FETCH).setQuery(queryString("*:*")).setFrom(from).setSize(50).execute().actionGet(); > > for (SearchHit searchHit : > response.hits().hits()) { > Map<String, Object> map = > searchHit.sourceAsMap(); > map.put("message", "updated"); > client.index( > > indexRequest("twitter").type("tweet").id(searchHit.getId()).source(map)).actionGet(); > } > } > > > 4. Check for number of updated docs: > > curl -XGET > 'http://localhost:9200/twitter/tweet/_count?q=message:updated' > {"count":397,"_shards":{"total":5,"successful":5,"failed":0}} > > > As you can see only 397 documents got updated. > > > It seems 500 updates do occur, but some docs are > matched and updated > twice and others never get updated. I've added debug > code in update: > > Set<String> updatedDocs = new HashSet<String>(); > > for (int from = 0; from < 500; from +=50) { > > SearchResponse response = > client.prepareSearch("twitter").setTypes("tweet").setSearchType(SearchType.QUERY_THEN_FETCH).setQuery(queryString("*:*")).setFrom(from).setSize(50).execute().actionGet(); > > for (SearchHit searchHit : > response.hits().hits()) { > Map<String, Object> map = > searchHit.sourceAsMap(); > map.put("message", "updated"); > client.index( > > indexRequest("twitter").type("tweet").id(searchHit.getId()).source(map)).actionGet(); > > //debug > > if(updatedDocs.contains(searchHit.getId())){ > System.out.println("Already updated > doc, ID: " + > searchHit.getId()); > }else{ > updatedDocs.add(searchHit.getId()); > } > } > } > > and got duplicates on output: > > Already updated doc, ID: 405 > Already updated doc, ID: 406 > Already updated doc, ID: 414 > Already updated doc, ID: 413 > Already updated doc, ID: 412 > Already updated doc, ID: 419 > > Also, one interesting thing is that number of actually > updated documents > seems to change from time to time for the same code. > For the example > above: first time only 300 got updated, second time > 397 docs were > updated and on last try 350 docs (I've deleted local > data/* and > re-indexed between update tests). > > > Tomislav > > > > > > > > > >