Version Conflicts in Bulk Upsert

Hello everyone,

We're using bulk API with upsert, without using version nor retry_on_conflict parameters. The architecture is as follows: 2 servers, each with different event types, constantly get data and write it in bulks to the same index. The upsert script handles situation where we're adding nested documents to root documents which haven't yet exist in the index. In addition, when a root document does exists - the script might decide to overwrite (replace) some of its data.

####The problem:
Lately, we have got more data coming. This caused conflicts to be raised about 5-6 times a day. The exception appears in the log is something like that:

"Version conflict, current [4], provided [3]".

####Possible explanations to the problem, as far as I can see:

  1. The servers might simultaneously try to write to the same documents.

  2. Each server works in parallel multi-threaded environment, meaning that several bulks are sent to ElasticSearch and the conflicts is between bulks sent from the same server.

  3. The bulk itself contains operations which are related to the same documents. I believe that's the root cause of the problem. Why? Because right now the code tries to handle this exception by re-sending the failed operations of the bulk again in another bulk. But it doesn't help. If the root cause was 1 or 2 mentioned above - this behavior would have solve the problem.

####Possible solution:

I believe ElasticSearch tries to handle in parallel the operations found in the bulk. That's why operations which are related to the same document might raise conflicts. Therefore, I believe that retry_on_conflict not going to help here, since ElasticSearch will just constantly fail. One solution is to avoid making related operations in the same bulk - but it's not always that simple. I'm thinking about catching the version conflict exception, and then handle it by sending the failed operations synchronously one by one (not in bulk). If this exception is rare and if only a few operations fail - I believe it's a decent behavior.

By the way, I do know that I cannot guarantee the order of the operations (between servers or between threads) - which is dangerous when overwriting data instead of just adding data. I'm going to solve this in the upsert script itself (the overwrite will be made only if the last_update field of current indexed document is older than the last_update field of the updating document).

What do you think?

You should definitely move away from nested documents. Every usage of nested documents I have seen over the years in production systems has led to issues, ranging from small to severe. You listed the possible root causes correctly - and the right solution here IMO is to redesign your documents so they do not use nested documents. It is possible in most cases even if it's complex and requires a mind shift. My 2 cents.

Itamar Syn-Hershko
Freelance Developer & Consultant
Microsoft MVP | Lucene.NET PMC | @synhershko

Wow, that's a very sad thing to hear. We have managed to stay away from Parent-Child (especially when we read that "Parent-child queries can be 5 to 10 times slower than the equivalent nested query!" in The Definitive Guide and that there is no support for parents aggregation), but our whole project is based on 2 multi-time-based indices with nested documents. Is parent-child better?

I would like to hear more about the issues you had with nested documents here or in a private message. Right now my problem is far from being severe - but I would like to know what's more to come :fearful:.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.