In our system events arrive in stream. In most cases, we just need to index (Add) the events to new or existsing documnts (Upsert Operation), however there is one event in which we need to update existing documents (Update Operation). For example: we get contant change-sets of "new order arrived", but after a while we might get an event of "order #3123 has cancled" - which means we need to find the old order document and update that it was cancled.
When using the bulk API, can ElasticSearch gurentee that all the upsert operations will arrive BEFORE the update oprations (in the same doucment)?
I have tried to look for an answer for that in the forum, and I was confused. Probably I'm the one who is not getting something . In one thread,@jpountz said that the order for the same doucment is saved. In other thread, @nik9000 and @jasontedor said that "if two concurrent requests touch the same document then it isn't really clear which one will come second unless the requests use versioning" and that "Elasticsearch is distributed and concurrent. We do not guarantee that requests are executed in the order they are received".
Anyway, the simplest and safest solution for that, as far as our understanding, is to delay the update operations and stream them to our system after something like 24 hours. Other more complx solutiom might be making sure no more than 2 request reference to tham same document id are found the the same bulk, but it would complex our applciations.
- Our document model is as follows: our document is consists of 3 different seperaed nested fields.
- We have 3 servers running all the time, for each nested document, and each of them produce constant upsert operations.
- Obviously, we get arround 3-4 times a day "VersionConflictEngineException". Our code just retry it. In the future, we will simply configure "retry_on_conflic: 3".
- We're using ElasticSearch 5.3.
- We're talking about one index, one type, one id which is relevant for each upsert or update operation.
- We don't use external versions (aka send version id by ourselves).
Your help in clarification of this matter will be much appriciated!
Sorry for the long description.
It's important to note that on the one hand @jpountz and on the other hand @nik9000 and myself are talking about different things. The comments by @jpountz are in the context of a single request where we guarantee ordering. The comments by @nik9000 and myself are in the context of multiple concurrent requests where we do not guarantee any ordering.
Does this answer your question?
Thank you for your quick reply!
I'm adding some demonstration in order to make sure I got it:
Product document is consists of 3 nested fields.
There are 3 active servers, each of them send one bulk at a time (bulk #4 doesn't really exist yet).
I understand that there is no gurentee for any order between Bulk #1, Bulk #2 and Bulk #3. It means that if I move the orange requests to a 4th bulk, they might occure before Bulk #3 which is dangerous - unless I make the 24 hours delay for the requests in this bulk.
However, the order of the requests in each bulk itself is gurenteed. This means that if the bulk consisnts of adding and then updating the same nested documents, everything is safe. Is the order done by priority (upserts occure before updates?), or by order of the requests in the bulk? And what hapeends if the some requests in the bulk get some kind of error (such as version conflict)? The later requests in the bulk will wait until the previous requests are resolved? Is this behaviour the same among all elasticsearch versions?
It's done by the order of the requests in a single bulk call. However if bulk 4 happens to be processed before bulk 3, then Elasticsearch will accept that.
You will receive an error back.
Nope, it will error.
Okay, it sure makes sense now.
Evently, we have chosen to seperate the update requests from the 3th bulk to a 4th "delayed" bulk. Even though according to this thread it's possible to make insert and update requests in the same bulk, in our case each update request runs a painless script who might be more fragile and slower than a regular insert/upsert request - so we wanted to isolate them from the others. And, in addition, if we will have to scale out and make more than a single bulk at once for adding new documents of a certain nested field - we can (adding documents rate is much higher than updating them).
Thank you for your references!
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.