How are version conflicts during _delete_by_query handled?

I have a question regarding _delete_by_query and (internal) versioning:

We're working with generational datasets, where a new generation obsoletes the previous one. A generation consists of around a million documents, each with a unique key. However, the keys vary over the generations:

  1. many keys from the previous generation will still be in the new generation; we want these documents to be updated in place
  2. some keys from the previous generation will no longer be in the new generation; we want these documents to be removed
  3. the new generation will also contain some keys that were not in the previous generation; we want these documents to be inserted

We're currently doing this as follows:

  • We're including the generation number in each document.
  • When a new generation becomes available, we perform bulk inserts with all new documents, each containing the new generation number. This will satisfy 1. and 3.
  • Immediately after finishing the bulk inserts we delete all documents that do not have the new generation number using _delete_by_query. This should satisfy 2 (but might delete too much if not all updates have finished?).

During the _delete_by_query we get version conflicts ( VersionConflictEngineException), probably because the index is still being updated. The question is what happens if we proceed on conflicts?

  • Can it be that there is a version conflict on an 'type 2.' document that is therefore not deleted and will remain present (until the next generation performs its clean-up)?
  • Can it be that there is a version conflict on a 'type 1.' document that was not yet updated? In which case it might get deleted and the update get lost?
  • Or can we only get version conflicts on 'type 1.' documents that were already updated, in which case the failing delete is not a problem (and it is safe to proceed on conflicts)?

Any insight is greatly appreciated.

PS: We're using the HTTP API on ES 5.5.
PS2: Using separate indices for different generations is not an option for us.

Could you go into a bit more detail why using separate indices is not an option? I would've approached this with index aliases and a separate index per generation.

Aliases are quasi instantaneous and avoid inconsistencies during the update. You can finish ingesting the next generation into its own index, then switch the alias to the new index (which makes all of it immediately available for searches), and at your leisure delete the previous generation(s).

The most important reason why separate indices is not an option is that we're performing searches during the updates and do not want to have any duplicates, nor any gaps. In addition, we're already using aliases for combining datasets that need to be searched together. Based on that we think we have to update 'in place'.

Another reason would be that we'd end up with thousands of indices, as we have several generations per day for different datasets with a retention period of 1 year.

Regardless of whether separate indices is possible or not, we'd like to understand what happens on the version conflicts. Could you shed some light on that?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.