We're using bulk API with upsert, without using
retry_on_conflict parameters. The architecture is as follows: 2 servers, each with different event types, constantly get data and write it in bulks to the same index. The upsert script handles situation where we're adding nested documents to root documents which haven't yet exist in the index. In addition, when a root document does exists - the script might decide to overwrite (replace) some of its data.
Lately, we have got more data coming. This caused conflicts to be raised about 5-6 times a day. The exception appears in the log is something like that:
"Version conflict, current , provided ".
####Possible explanations to the problem, as far as I can see:
The servers might simultaneously try to write to the same documents.
Each server works in parallel multi-threaded environment, meaning that several bulks are sent to ElasticSearch and the conflicts is between bulks sent from the same server.
The bulk itself contains operations which are related to the same documents. I believe that's the root cause of the problem. Why? Because right now the code tries to handle this exception by re-sending the failed operations of the bulk again in another bulk. But it doesn't help. If the root cause was 1 or 2 mentioned above - this behavior would have solve the problem.
I believe ElasticSearch tries to handle in parallel the operations found in the bulk. That's why operations which are related to the same document might raise conflicts. Therefore, I believe that
retry_on_conflict not going to help here, since ElasticSearch will just constantly fail. One solution is to avoid making related operations in the same bulk - but it's not always that simple. I'm thinking about catching the version conflict exception, and then handle it by sending the failed operations synchronously one by one (not in bulk). If this exception is rare and if only a few operations fail - I believe it's a decent behavior.
By the way, I do know that I cannot guarantee the order of the operations (between servers or between threads) - which is dangerous when overwriting data instead of just adding data. I'm going to solve this in the upsert script itself (the overwrite will be made only if the
last_update field of current indexed document is older than the
last_update field of the updating document).
What do you think?