External Versioning Enhancement need your input!

I would like to propose an enhancement to ES external versioning. Currently
external version is a Long number and larger number implies newer version.
It works great if your source is a single entity (say a simple persistent
object managed with JPA and we pass JPA managed entity version it uses for
optimistic locking to ES).

Not so well when you pass a graph of persistent objects to ES which happens
all the time since ES is all about de-normalizing the data. The problem is
that now each entity of the graph has its own version and in most cases
modification of the child entity content should only increment that
entity's version and not owning (or related) objects version. So for this
real world scenario the simple and elegant idea of using your source entity
versioning infrastructure does not work.

So here is the idea:

Support composite version indicator which is simply an array of versions of
source entities which made denormalized ES document. ES would not need to
understand it (it is data provider's responsibility to supply right array
of versions) just compare it from left to right to make sure each element
is larger.

There could be two main use cases

  1. When parent object of the denormalized graph has 1-1 relationship with
    all its parts (say Person has reference to HomeAddress and WorkAddress)
    then version array is of fixed length: [personVersion, homeAddressVersion,
    workAddressVersion]

  2. When there are any 1-N relationships (i.e. PurchaseOrder and its POLines
    gets denormalized) in which case we will use nested array for PO lines
    [purchaseOrderVersion, POLinesVersions[]]. The trick here is that parent
    version should precede child versions array. So if parent versions are not
    equal no need to continue. If they are equal, then both collections of
    POLines of the two versions must be identical (any difference due to
    collection add/remove operations should increment PurchaseOrder version)
    and thus version arrays must have a) the same number.order of elements b)
    newer PurchaseOrder lines will have larger POLines versions in the array)

Well the description is rather lengthy but ES algorithm is trivial -
compare array of integers or nested arrays of integers recursively - very
simple and fast!

Such approach will allow very robust external versioning for denormalized
object graphs based on versions of their parts using array of versions
concept. Using simple number based versioning will of course stay as well

An alternative to array could be a binary string with predefined length
per version but I think array is easier to deal with

What do you think? As for me it would dramatically simplify my life and
improve performance as I no longer need to intercept changes to versions of
all parts to increment synthetic version of entire graph (not to mention
the fact that when denormalization can be done in several ways one need to
maintain or calculate several synthetic versions for each graph

What do you think? Is it something worth proposing to ES dev team?

Thank you,
Alex

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

From what I understand, the concept of "versioning" for ES documents is
used for optimistic concurrency control, not for a version history. The
ES document version is like a semaphore controlling the order documents
of same identity arrive in the index, and is atomic by nature. The
support for atomic longs is well supported by hardware, even on CPU level.

From the view of CPU operations, an array of versions is a complex
datatype which has a list of elements. This makes atomic operations
expensive because the operation must traverse such a list. If you want
an array of versions for managing a version list of entities, think
about the possibility of having a separate document for each of the
entities.

Jörg

Am 31.01.13 19:37, schrieb AlexR:

I would like to propose an enhancement to ES external versioning.
Currently external version is a Long number and larger number implies
newer version.
It works great if your source is a single entity (say a simple
persistent object managed with JPA and we pass JPA managed entity
version it uses for optimistic locking to ES).

Not so well when you pass a graph of persistent objects to ES which
happens all the time since ES is all about de-normalizing the data.
The problem is that now each entity of the graph has its own version
and in most cases modification of the child entity content should only
increment that entity's version and not owning (or related) objects
version. So for this real world scenario the simple and elegant idea
of using your source entity versioning infrastructure does not work.

So here is the idea:

Support composite version indicator which is simply an array of
versions of source entities which made denormalized ES document. ES
would not need to understand it (it is data provider's responsibility
to supply right array of versions) just compare it from left to right
to make sure each element is larger.

There could be two main use cases

  1. When parent object of the denormalized graph has 1-1 relationship
    with all its parts (say Person has reference to HomeAddress and
    WorkAddress) then version array is of fixed length: [personVersion,
    homeAddressVersion, workAddressVersion]

  2. When there are any 1-N relationships (i.e. PurchaseOrder and its
    POLines gets denormalized) in which case we will use nested array for
    PO lines [purchaseOrderVersion, POLinesVersions[]]. The trick here is
    that parent version should precede child versions array. So if parent
    versions are not equal no need to continue. If they are equal, then
    both collections of POLines of the two versions must be identical (any
    difference due to collection add/remove operations should increment
    PurchaseOrder version) and thus version arrays must have a) the same
    number.order of elements b) newer PurchaseOrder lines will have larger
    POLines versions in the array)

Well the description is rather lengthy but ES algorithm is trivial -
compare array of integers or nested arrays of integers recursively -
very simple and fast!

Such approach will allow very robust external versioning for
denormalized object graphs based on versions of their parts using
array of versions concept. Using simple number based versioning will
of course stay as well

An alternative to array could be a binary string with
predefined length per version but I think array is easier to deal with

What do you think? As for me it would dramatically simplify my life
and improve performance as I no longer need to intercept changes to
versions of all parts to increment synthetic version of entire graph
(not to mention the fact that when denormalization can be done in
several ways one need to maintain or calculate several synthetic
versions for each graph

What do you think? Is it something worth proposing to ES dev team?

Thank you,
Alex

You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

One of very important aspect of external versioning is preventing out of order updates. If versioning is external you do not need optimistic locking you just discard older updates

It is not trivial to version denormalized graph of objects uaing single bersion indicatorwhen eachnode is versioned independently hense the proposal

Again with external versions it is the data source that does optimistic locking ES jist needs to compare version array to figure which update is newer

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Feedback anyone?
Is it important to anyone who does push from database backed app with
multiple app servers to avoid out of sequence updates issues?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey,

I personally think that moving away from a numerical version is not trivial
to begin with and I don't think it gives us a real gain in functionality. I
personally used a timestamp as the external version if I had similar
problems you have and that works very well. Do you think this could help
you too?

simon

On Friday, February 1, 2013 9:39:55 PM UTC+1, AlexR wrote:

Feedback anyone?
Is it important to anyone who does push from database backed app with
multiple app servers to avoid out of sequence updates issues?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Simon,

It's not about timestamp vs. numeric version it's about how to do it for
complex concurrently updated graph of objects where any part of it can
change and ensure that denormalized version of the graph sent to ES is not
a stale version. One approach is that every application doing push of
denormalized objects, will track and roll up changes to any part of the
graph to its top entity tracking them using numeric version or a timestamp.
It is fairly labor intensive and actually not trivial in highly concurrent
environment if you use ORM technologies (JPA, JDO) without introducing
concurrency issues (use optimistic locking technique is not an option for
big graph as it will introduce huge concurrency problems). Another approach
is to version of every object of the graph sent over to ES as array of
their versions. Which is very simple because the versions are already
present in individual entities and managed as part of optimistic locking
for those individual objects.

So while support for an array of versions require some work on ES, it
actually logically very similar to using single version value (which of
course still be default option)

I have to admit I have not thought through my approach in details - I hoped
to get some discussion going and see what people do in cases like this.
I see some issues with it such as versions of untouched entities in the
graph pushed to ES will not necessarily be the latest without causing any
optimistic locking exception. Say two users read the same purchase order,
one updated one line item and the other updated the other. Unless we
control concurrency on PO level denying one user his update they will be
both successful and if they both push to ES at the same time ES will have
inconsistent data. Unfortunately escalating concurrency control to the very
top of the graph is not an option in most of the cases as it will cause
very high level of unwarranted (by business logic) contentions on that
single lock (or version indicator).

I guess it may be that, fundamentally, a consistent push of an object
graph is not possible unless you enforce concurrency for entire graph as a
whole which I do not think is acceptable in a transactional system where
many users updates parts of the graph.

So maybe I will have to resort to a hybrid pull/push approach where my app
servers will post only IDs of modified objects (or rather their top level
owner's IDs) and indexer will pull all IDs get collapse any redundancies
and pull the latest data into the index in an optimal way. The downside is
that pulling the graph from DB on indexer side means doubling database
load duplicating all the reads and also potentially more latency than with
push

Any input or suggestions would be very welcome...

Alex

On Saturday, February 2, 2013 3:23:12 PM UTC-5, simonw wrote:

Hey,

I personally think that moving away from a numerical version is not
trivial to begin with and I don't think it gives us a real gain in
functionality. I personally used a timestamp as the external version if I
had similar problems you have and that works very well. Do you think this
could help you too?

simon

On Friday, February 1, 2013 9:39:55 PM UTC+1, AlexR wrote:

Feedback anyone?
Is it important to anyone who does push from database backed app with
multiple app servers to avoid out of sequence updates issues?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.