Hi everyone, i'm trying to optimize update speed on my ES and want to know deeper about it's update and update_by_query. So I have some question:
how update in elasticsearch really word. As I know when I update 1 doc, it need a time to refresh and then I can see the result. but before it refresh, if I send another update request to that doc, it will skip first update and start with later update request right?
I have read ES doc of update_by_query and know that after use search to query, it will bulk with docs from the query
While processing an update by query request, Elasticsearch performs multiple search requests
sequentially to find all of the matching documents. A bulk update request is performed for each
batch of matching documents. Any query or update failures cause the update by query request to
fail and the failures are shown in the response. Any update requests that completed successfully
still stick, they are not rolled back.
so if i have exactly docs index ( don't need to query to know) so what is the faster way bulk or update by query? link to my old question how to not let ES high load.
So combine 2 question, after i use first update_by_query request then immediately send second update_by_query request ( not have time to refresh doc) will it will word, and i should i use second update_by_query request instead of bulk?
If the second update makes it to the shard and is applied, it will still work yes.
However if you are talking about running multiple updates and worrying about the timing of refreshes, you might want to rethink your use of Elasticsearch for your problem as this approach seems seriously inefficient.
hi @warkolm, thanks for reply me.
hmmm, so it will do request as FIFO right? in my case there are 2 request to a same doc/shard. if ES has response for first request, it will complete right? and can i immediately send 2nd request?
When sending bulk update requests you specify the ids of the documents to be updated. This will always update the latest version of the document, even if it is not yet searchable as a refresh has not taken place.
When you use update by query a query is first run and the update then performed based on this. If any change has not been refreshed and made available y the time you start the query you will likely see a conflict.
so in conclusion, i need to refresh after first request right? cause if I set refresh = true, my javaclientAPI will throw an ERROR timeout, how can I handle it? . What about speed and performent between bulk and update_by_query? I though that bulk need to search element then update it, and update_by_query can search elements first then update them all, is it right? Thanks
They both need to retrieve the document and then update it so there is no difference there. Update by query has to run a query while the bulk request has direct access to the IDs, so that is probably where the difference lies. I would not expect any major performance difference between the approaches, but it is probably best if you test it.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.