Hello everyone,
####Background:
We're using bulk API with upsert, without using version
nor retry_on_conflict
parameters. The architecture is as follows: 2 servers, each with different event types, constantly get data and write it in bulks to the same index. The upsert script handles situation where we're adding nested documents to root documents which haven't yet exist in the index. In addition, when a root document does exists - the script might decide to overwrite (replace) some of its data.
####The problem:
Lately, we have got more data coming. This caused conflicts to be raised about 5-6 times a day. The exception appears in the log is something like that:
"Version conflict, current [4], provided [3]".
####Possible explanations to the problem, as far as I can see:
-
The servers might simultaneously try to write to the same documents.
-
Each server works in parallel multi-threaded environment, meaning that several bulks are sent to Elasticsearch and the conflicts is between bulks sent from the same server.
-
The bulk itself contains operations which are related to the same documents. I believe that's the root cause of the problem. Why? Because right now the code tries to handle this exception by re-sending the failed operations of the bulk again in another bulk. But it doesn't help. If the root cause was 1 or 2 mentioned above - this behavior would have solve the problem.
####Possible solution:
I believe Elasticsearch tries to handle in parallel the operations found in the bulk. That's why operations which are related to the same document might raise conflicts. Therefore, I believe that retry_on_conflict
not going to help here, since Elasticsearch will just constantly fail. One solution is to avoid making related operations in the same bulk - but it's not always that simple. I'm thinking about catching the version conflict exception, and then handle it by sending the failed operations synchronously one by one (not in bulk). If this exception is rare and if only a few operations fail - I believe it's a decent behavior.
####Remark:
By the way, I do know that I cannot guarantee the order of the operations (between servers or between threads) - which is dangerous when overwriting data instead of just adding data. I'm going to solve this in the upsert script itself (the overwrite will be made only if the last_update
field of current indexed document is older than the last_update
field of the updating document).
What do you think?