for example, my thread pool size is 12 so it would be run 12 thread at once.
And the threads will request 2,000 actions at one time.
If 12 processes try to update the same document concurrently,
In the worst case, the conflict will have occurred such as below the number.
122,000=24000 -1=23999
(thread countnumber of thread documents)-exclude myself
Is it the right answer?
Question 1.
Please, somebody, help me what's the correct value of retry_on_conflict?
How can I configure the right value of retry_on_conflict?
Question 2.
Is it guarantee only once performed when the conflict occurred?
Question 3.
Should I add "refresh=true" param to each document?
what is different?
Question 4.
Is there a limitation of retry_on_conflict param value?
Is there performance issue when I added to bulk action?
Thank you for reading my article.
If you know, please feel free to tell me.
Elasticsearch cannot know what a useful retry_on_conflict count in your application is, as it depends on what your application is actually changing (incrementing a counter is easier than replacing fields with concurrent updates).
The first question you should ask yourself is, if you need this at all, or if your indexing infrastructure already ensures that you are only indexing in a serialized manner. If you need parallel indexing of similar documents, what are the worst case outcomes. Do you have components that only change different parts of the documents (one is updating facebook info, the other twitter) and each different updater can only run at once, then you can use a small number (the number of updaters plus some legroom).
Q2: When a conflict occurs. the Update API stops after a single invocation due to its optimistic concurrency control, see https://www.elastic.co/guide/en/elasticsearch/guide/current/optimistic-concurrency-control.html
Q3: No. The Get API is used, which does not require a refresh.
Q4: Not sure what you mean with limitation here. Performance will be different, because you are retrying another index operation instead of stopping after the first. So the higher the value is set, the more additional (and potentially failed) index operations might be performed per document.
Additional Question)
Finally, I want to know your opinion that using retry_on_conflict param is the right way or not?
In my opinion, When I see below link.
I think that using retry_on_conflict is the right way under parallel concurrency model.
Despite 20 threads and 2000 documents per thread.
again it depends on your use-case and how you use scripts. If you increment a counter, then the order of incrementing might not matter to you, so having a higher retry_on_conflict value is fine. The same applies if you have concurrent updates on different parts of the document, if you just want to make sure that all the updates are written.
However, if you overwrite fields and simply replace those values, then you might need to go back to your own application and let that application decide how to handle this. Maybe you can merge the data that has been written with the data that you want to write, maybe overwriting is ok.
For many cases, update API plus retry_on_conflict is good solution, for some it's a nogo, and thats how you evaluate if you want to use it or not.
Hope this helps, even though it is not a definite answer
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.