Preface -
Cluster is running version 6.8 and we are doing a mix of search/create/update using the NodeJS
Operations can access the same document in quick succession/concurrently since its based off events coming from kafka
All actions are sent with refresh:true since we wish to be in sync as much as possible
and update requests are sent with retry_on_conflict with some high number (5)
The issue -
No matter how high we set the retry_on_conflict number we still get version mismatch exceptions.
What i don't understand is where the seqNo is coming from - when looking at the document we get from searching and use to update there is no version/seqNo/primaryTerm in it
Since we cannot reduce the concurrency of event handling per document the current idea is to re-fetch the document and then redo the update logic, but i don't see how it will change anything if the seqNo/primaryTerm/version is not there anyway
What is the correct way to handle these kinds of use cases? on the surface since we force a refresh it should work as close to a synchronous database as possible, additionally with the retry_on_conflict parameter I expected that it would solve the issue completely
so if you do not set the correct values that are returned from the previous index operation you will always get a mismatch
Optimistic concurrency control
Index operations can be made conditional and only be performed if the last modification to the document was assigned the sequence number and primary term specified by the if_seq_no and if_primary_term parameters. If a mismatch is detected, the operation will result in a VersionConflictException and a status code of 409. See Optimistic concurrency control for more details.
In newer versions there are some additional options such as wait_for that can help with these things
1.True, 6.8 is indeed ancient but I am not aware of any security/breaking issues that are fixed between the latest 6.8 and 7.17. It is planned anyway
2.In the case that I don't fetch seqNo/primaryTerm, then what is the function of retry_on_conflict? it is server side logic as far as I can tell but I see no way for elasticsearch to know the "correct" seqNo/primaryTerm from my request
3.wait_for exists in 6.8 as well and from the docs it looks like a "weaker" version of true. Is conflict resolution different when using wait_for?
For example we have 3 update requests for the same document - using wait_for or true seems to be like it lead to the same outcome: a conflict will occur and won't be solved by ES
What API and how are you calling that API that generates the conflict. _bulk or update?
Also you can get the _seq_no with any GET by _id call if you wanted to build your own logic and resubmit.
Per the docs Update is a 2 phase operations GET then Index so this would indicate why your are seeing the conflict... So the GET gets the _seq_no and primary term and by the time it tries to index the doc they have been changed, that is how I read it.
retry_on_conflict In between the get and indexing phases of the update, it is possible that another process might have already updated the same document. By default, the update will fail with a version conflict exception. The retry_on_conflict parameter controls how many times to retry the update before finally throwing an exception.
Perhaps you are running into something else but that would seem to be a pretty good choice of the explanation...
Curious what rate you are calling refresh...
Perhaps a colleague of mine might have a comment @DavidTurner any insight?
We don't have any bulk calls so all the exceptions are thrown on update operations (partial doc and script)
2.rate of refresh would be about 100/s for updates and 500/s for updates. if i can trust the metrics that is.
3.It's not that i want to build my own logic it's just that the built in solution seems to not work for me in quite high %. My assumption here is that multiple events come in at once from kafka causing the logic to send multiple update requests at the same time, when they conflict they retry at the same time so some of them fail even if the retry count is high
if i would build my own logic i would use random delay between retries so things sort themselves out, hopefully.
Doesn't refresh rate of 100s for update mean any subsequent update of the same doc within 100s will result in version conflict. How often do you write? I assume it's faster than once per 100s.
Sorry its not 100 seconds, the refresh rate is the default (which should be 1 second? 30 seconds? unsure)
i meant there are 100 refreshes per second for updates and 500 per second for index
I have nothing to add, I think you covered everything. I don't remember what was or wasn't available in 6.8, it's too old, but what you say sounds reasonable for all versions that aren't past EOL.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.