Two Rivers modifying the same docs - overwriting, union, or...?

Hi,

What would happen if 2 Rivers (JDBC one and MongoDB one concretely) were to
add/modify the same documents in ES? Would they step on each other's toes
and overwrite each other's data, or would they complement each other?

2 cases:

A) imagine a record for Bubba Gump Shrimp Co. with ID=123 is pulled from an
Oracle DB by JDBC river and indexed with ID="123" owner="Bubba Gump".
Then a moment later the MongoDB river notices its record for ID=123 has
been modified. In MongoDB the owner field has value="Papa Smurf". And it
writes that to ES.

What is the value of owner field? I assume "Papa Smurf".

B) imagine fields in Oracle DB and MongoDB do not overlap, other than the
ID field. Oracle has the "owner" field and MongoDB doesn't have it. It
has "market_cap" field, for example.
The rivers go through the same steps as above:
JDBC river writes data from Oracle: ID="123" company="Bubba..."
owner="Bubba Gump".
MongoDB river writes data from MongoDB: ID="123" market_cap="$12345".

What is in the DB?
Is it a union of the 2 records, or does the last one win?

Thanks,
Otis

ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html

--

IMHO:

  1. "Papa Smurf"
  2. ID="123" company="Bubba..." owner="Bubba Gump market_cap="$12345"

Very nice idea to merge data from different sources. Smart.
I love the use case.

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 11 janv. 2013 à 00:50, Otis Gospodnetic otis.gospodnetic@gmail.com a écrit :

Hi,

What would happen if 2 Rivers (JDBC one and MongoDB one concretely) were to add/modify the same documents in ES? Would they step on each other's toes and overwrite each other's data, or would they complement each other?

2 cases:

A) imagine a record for Bubba Gump Shrimp Co. with ID=123 is pulled from an Oracle DB by JDBC river and indexed with ID="123" owner="Bubba Gump".
Then a moment later the MongoDB river notices its record for ID=123 has been modified. In MongoDB the owner field has value="Papa Smurf". And it writes that to ES.

What is the value of owner field? I assume "Papa Smurf".

B) imagine fields in Oracle DB and MongoDB do not overlap, other than the ID field. Oracle has the "owner" field and MongoDB doesn't have it. It has "market_cap" field, for example.
The rivers go through the same steps as above:
JDBC river writes data from Oracle: ID="123" company="Bubba..." owner="Bubba Gump".
MongoDB river writes data from MongoDB: ID="123" market_cap="$12345".

What is in the DB?
Is it a union of the 2 records, or does the last one win?

Thanks,
Otis

ELASTICSEARCH Performance Monitoring - http://sematext.com/spm/index.html

--

--

JDBC river is using an internal node client, I think, document collisions
do not depend on the river feature. Document collisions are avoided
effectively by the version feature of ES (JDBC river is able to set a
_version attribute). But it seems the question here is the case of an
"unversioned" document.

From my knowledge, the version feature is the notion of a "document sending
order" an indexing client could insist of. Documents are distributed by a
hash across the shards, and the shards can be at remote sites. Some nodes
could be busy, while others may not (segment merging for example). There
are many sender times and a receiver time. Without versioning, the document
wins which is arriving at the receiving primary shard at last. It may
depend on the different load of the involved nodes in the cluster, what
document will arrive last at a primary shard, when the sender times are
very close or even the same.

To merge documents in a distributed multi river setup, something like a
"super-river" must be able to achieve control about the order of the
fetched documents from different sub-rivers, so documents for the same ID
could be merged early and then passed over to an ES node indexing client.
Right now, there is no way rivers could communicate or synchronize, except
versioning.

Jörg

--