Potential Race Condition during Document Updates in ES 6.7

We've recently upgraded our stand-by cluster to ES 6.7.1 from ES 5.6. Since our upgrade we're seeing a situation where some updates to our documents are not being applied correctly.

Some additional details on our use case.

  1. We have a system that asynchronously updates ES documents in our cluster.
  2. We have two different modules that update documents. One is adding fields, the other is removing fields.
  3. When adding/updating fields we are using the Update API and are setting the doc_as_upsert flag = true and retry_on_conflict = 3.
  4. When removing fields we are doing scripted updates to remove the fields via the UPDATE API. We're also setting retry_on_conflict = 3 in this case.

We have a portion of our system that kicks off both an update and a delete to some documents. We are seeing cases where some documents only have one of the two operations performed successfully. We are seeing cases where only the delete operation is performed successfully but are also seeing cases where only the update operation performed successfully.

It's important to note that we're using the ElasticSearch Python library to in our code that updates ES.

We didn't run into this issue in 5.6 as the retry_on_conflict setting seemed to ensure our operations completed. We're not seeing the same behavior in ES 6.7.1.

We understand that ES changed how document status is tracked and added the Optimistic Concurrency control. We're concerned that document versioning and update conflict checking no longer works as we expect from 5.6.

We're currently doing further debugging and research. We're trying to enable the TRACE logs to did deeper in to what's going on.

Is anyone else running into this? Any help you can provide is appreciated.

We'll post updates as we find more details.

Thanks!

Some additional details (I work with Ryan):

This seems to be an issue in v6.7.1 when two updates happen simultaneously to the same document, but one of them is a scripted field delete. I've checked a handful of occurrences, and it the issue seems to happen when the scripted field delete happens first in the logs, and the update happens right after. We're using retry_on_conflict, as these operations do not have a specific order, and they operate on different fields.

When things are working normally, here's what the v6.7.1 logs look like:
The first operation appears (update fields), and the full document has the updated fields applied.
The second operation appears (scripted delete), and the full document has the deleted fields removed and the updated fields applied.

When the order of operations is reversed, here's what the logs look like:
The first operation appears (scripted delete), and the full document has the deleted fields removed.
The second operation appears (update fields), and the full document has the updated fields applied, but the deleted fields are still in the document.

In this bad scenario, we've seen handfuls of documents end up in one of two cases:

  • The update operation worked, but the field delete operation didn't. The "deleted" fields are still in the final document.
  • Vice versa: the deleted fields are gone, but the field updates didn't happen

We've confirmed that the retry_on_conflict parameter works on two normal document updates in v6.7.1 - I've seen conflict exceptions in the logs, and they get handled appropriately.

None of this seems to happen in v5.6.14. The exact same updates are happening on that cluster, but all the documents end up in a good state.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.