I have a question regarding _delete_by_query
and (internal) versioning:
We're working with generational datasets, where a new generation obsoletes the previous one. A generation consists of around a million documents, each with a unique key. However, the keys vary over the generations:
- many keys from the previous generation will still be in the new generation; we want these documents to be updated in place
- some keys from the previous generation will no longer be in the new generation; we want these documents to be removed
- the new generation will also contain some keys that were not in the previous generation; we want these documents to be inserted
We're currently doing this as follows:
- We're including the generation number in each document.
- When a new generation becomes available, we perform bulk inserts with all new documents, each containing the new generation number. This will satisfy 1. and 3.
- Immediately after finishing the bulk inserts we delete all documents that do not have the new generation number using
_delete_by_query
. This should satisfy 2 (but might delete too much if not all updates have finished?).
During the _delete_by_query
we get version conflicts ( VersionConflictEngineException
), probably because the index is still being updated. The question is what happens if we proceed
on conflicts?
- Can it be that there is a version conflict on an 'type 2.' document that is therefore not deleted and will remain present (until the next generation performs its clean-up)?
- Can it be that there is a version conflict on a 'type 1.' document that was not yet updated? In which case it might get deleted and the update get lost?
- Or can we only get version conflicts on 'type 1.' documents that were already updated, in which case the failing delete is not a problem (and it is safe to proceed on conflicts)?
Any insight is greatly appreciated.
PS: We're using the HTTP API on ES 5.5.
PS2: Using separate indices for different generations is not an option for us.