Idempotent indexing into ElasticSearch when neither data nor version number have changed?


(Michael Snell) #1

Hi - Currently, when using external versioning, ElasticSearch always
returns the same error when there is a version number conflict, eg:

VersionConflictEngineException[[twitter][2] [tweet][1]: version conflict,
current [3], required [2]]

Is it possible to change this so that a different error is returned when
neither the version number nor the data has changed? In other words, it
should be possible to differentiate between attempting to reindex the same
data with the same version number (which is inefficient but harmless) with
attempting to index different data with the same version number (which
indicates a bug version numbering in the source system).


(Shay Banon) #2

If you are using the Java API, then you can get the current version and
provided version in the VersionConflictEngineException, if you are using
the REST API, your best bet is to parse it. I have been meaning to allow
for "failures" to be serialized into json (for example) to provide more
metadata on the failure itself for the REST API.

On Wed, May 16, 2012 at 11:39 AM, Michael Snell michael@snell.com wrote:

Hi - Currently, when using external versioning, ElasticSearch always
returns the same error when there is a version number conflict, eg:

VersionConflictEngineException[[twitter][2] [tweet][1]: version conflict,
current [3], required [2]]

Is it possible to change this so that a different error is returned when
neither the version number nor the data has changed? In other words, it
should be possible to differentiate between attempting to reindex the same
data with the same version number (which is inefficient but harmless) with
attempting to index different data with the same version number (which
indicates a bug version numbering in the source system).


(Michael Snell) #3

Hi - I don't think I explained the problem clearly: The issue is, given
that we can already tell the version number hasn't changed, can we tell
whether the data itself has changed or not?

In our case, we use the Java API and already parse out the current and
provided version number from the error string (using a regular expression)
so we know if we've attempted to index the same version. However, we need
to differentiate between:

  1. Same version and same data: Inefficient but harmless, we'd like to
    ignore this (could be caused for example by indexing a list of items which
    include some which have not actually changed)

  2. Same version but different data: This indicates a data inconsistency, so
    we'd like to raise an exception if this occurs (could be caused for example
    by attempting to index data which has been updated in the database via a
    manual SQL update, but without the version column being incremented)

On Wednesday, 16 May 2012 22:58:35 UTC+1, kimchy wrote:

If you are using the Java API, then you can get the current version and
provided version in the VersionConflictEngineException, if you are using
the REST API, your best bet is to parse it. I have been meaning to allow
for "failures" to be serialized into json (for example) to provide more
metadata on the failure itself for the REST API.

On Wed, May 16, 2012 at 11:39 AM, Michael Snell michael@snell.com wrote:

Hi - Currently, when using external versioning, ElasticSearch always
returns the same error when there is a version number conflict, eg:

VersionConflictEngineException[[twitter][2] [tweet][1]: version conflict,
current [3], required [2]]

Is it possible to change this so that a different error is returned when
neither the version number nor the data has changed? In other words, it
should be possible to differentiate between attempting to reindex the same
data with the same version number (which is inefficient but harmless) with
attempting to index different data with the same version number (which
indicates a bug version numbering in the source system).


(Shay Banon) #4

Ahh, I see. Since the index response does not return the data indexed on
version conflict failure, then no, you can't. You could do a get before and
then possibly compare, and use the version returned from get to update the
data...

On Thu, May 17, 2012 at 9:52 AM, Michael Snell michael@snell.com wrote:

Hi - I don't think I explained the problem clearly: The issue is, given
that we can already tell the version number hasn't changed, can we tell
whether the data itself has changed or not?

In our case, we use the Java API and already parse out the current and
provided version number from the error string (using a regular expression)
so we know if we've attempted to index the same version. However, we need
to differentiate between:

  1. Same version and same data: Inefficient but harmless, we'd like to
    ignore this (could be caused for example by indexing a list of items which
    include some which have not actually changed)

  2. Same version but different data: This indicates a data inconsistency,
    so we'd like to raise an exception if this occurs (could be caused for
    example by attempting to index data which has been updated in the database
    via a manual SQL update, but without the version column being incremented)

On Wednesday, 16 May 2012 22:58:35 UTC+1, kimchy wrote:

If you are using the Java API, then you can get the current version and
provided version in the VersionConflictEngineException, if you are
using the REST API, your best bet is to parse it. I have been meaning to
allow for "failures" to be serialized into json (for example) to provide
more metadata on the failure itself for the REST API.

On Wed, May 16, 2012 at 11:39 AM, Michael Snell michael@snell.comwrote:

Hi - Currently, when using external versioning, ElasticSearch always
returns the same error when there is a version number conflict, eg:

VersionConflictEngineException**[[twitter][2] [tweet][1]: version
conflict, current [3], required [2]]

Is it possible to change this so that a different error is returned when
neither the version number nor the data has changed? In other words, it
should be possible to differentiate between attempting to reindex the same
data with the same version number (which is inefficient but harmless) with
attempting to index different data with the same version number (which
indicates a bug version numbering in the source system).


(Michael Snell) #5

Hi - We've now implemented this ourselves, ie if a
VersionConflictEngineException is returned, use a regexp to extract the
current and provided version number. If they are equal, get the current
source in ElasticSearch and compare with the source we are trying to index.
Only raise an error if they differ.

This seems to work well enough, but I'm sure it would be more efficient if
ElasticSearch did this internally - perhaps a features request could be
raised for a future version?

On Sunday, 20 May 2012 20:41:24 UTC+1, kimchy wrote:

Ahh, I see. Since the index response does not return the data indexed on
version conflict failure, then no, you can't. You could do a get before and
then possibly compare, and use the version returned from get to update the
data...

On Thu, May 17, 2012 at 9:52 AM, Michael Snell michael@snell.com wrote:

Hi - I don't think I explained the problem clearly: The issue is, given
that we can already tell the version number hasn't changed, can we tell
whether the data itself has changed or not?

In our case, we use the Java API and already parse out the current and
provided version number from the error string (using a regular expression)
so we know if we've attempted to index the same version. However, we need
to differentiate between:

  1. Same version and same data: Inefficient but harmless, we'd like to
    ignore this (could be caused for example by indexing a list of items which
    include some which have not actually changed)

  2. Same version but different data: This indicates a data inconsistency,
    so we'd like to raise an exception if this occurs (could be caused for
    example by attempting to index data which has been updated in the database
    via a manual SQL update, but without the version column being incremented)

On Wednesday, 16 May 2012 22:58:35 UTC+1, kimchy wrote:

If you are using the Java API, then you can get the current version and
provided version in the VersionConflictEngineException, if you are
using the REST API, your best bet is to parse it. I have been meaning to
allow for "failures" to be serialized into json (for example) to provide
more metadata on the failure itself for the REST API.

On Wed, May 16, 2012 at 11:39 AM, Michael Snell michael@snell.comwrote:

Hi - Currently, when using external versioning, ElasticSearch always
returns the same error when there is a version number conflict, eg:

VersionConflictEngineException**[[twitter][2] [tweet][1]: version
conflict, current [3], required [2]]

Is it possible to change this so that a different error is returned
when neither the version number nor the data has changed? In other words,
it should be possible to differentiate between attempting to reindex the
same data with the same version number (which is inefficient but harmless)
with attempting to index different data with the same version number (which
indicates a bug version numbering in the source system).


(system) #6