Jdbc river, versioning and manual updates

Umit_Seren · July 10, 2013, 7:57am

I have a couple of jdbc rivers that run every hour (using the simple
strategy). They work fine.

Now I have the requirement that new records should be "instantly" visible
in the search and also in some lists that get populated by ES.
In oder to do this I manually add the new records to ES as soon as they are
persisted in the DB.
However after the next river run those documents get deleted by the
housekeeping functionality of the jdbc river because they have a lower
version number than the the version number of the jdbc river.

How does the jdbc river deal with new document ? Is the version of newly
added rows automatically set to the one that is stored in the jdbc river
index ?
If so when manually adding documents to the ES index should I use
version_type = external and set the version to the one of the jdbc river ?

cheers
Ümit

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · July 10, 2013, 12:14pm

Thanks for your interest in JDBC river.

The JDBC river uses external versioning to solve a hard challenge, the
task of moving tabular data (SQL) to hierarchical documents (JSON). By
doing this by an SQL select statament with labeled column names (dot
notation) arbitrary JSON can be build out of rows. It is assumed a
single result row generates a single JSON doc.

In a river, there are three types of op types: create, index, delete.

So I have to deal with updates to existing JSON docs. Well, I overwrite
this docs (which is not optimal, it neglects partial updates).

Then, adding new docs. When new docs are added, there are two
possibilities of interpretation: a new whole index, or incremental adding.

Deleting data is most sensitive. Unfortunately, SQL deletes do not
propagate to the index from the RDBMS. To detect row deletions, I use
generations of SQL selects. Each river cycle is a generation. The
generation is the external version in ES. To ensure only one generation
exists, I have implemented a "housekeeper" that deletes all(!) docs that
have a lower version.

I'm aware this is not ideal for every situation. For example, it is
incompatible to incremental updates, which is expected intuitively by
many users.

So I prepared a very modular architecture (river source, river flow,
river mouth) so if you need another philosophy it should be quite easy
to add your customized strategy to the code by reusing much of the
existing code. The ES community is so smart and strong, I think there
must be more JDBC to ES strategies.

I have not thought about manipulating docs manually in an existing JDBC
river index. There is a generation number in the river but it is
unstable (it will be reset at a river restart). The "housekeeper" is
very naive and can be confused by manual doc insertions.

Jörg

Am 10.07.13 09:57, schrieb Ümit Seren:

I have a couple of jdbc rivers that run every hour (using the simple
strategy). They work fine.

Now I have the requirement that new records should be "instantly"
visible in the search and also in some lists that get populated by ES.
In oder to do this I manually add the new records to ES as soon as
they are persisted in the DB.
However after the next river run those documents get deleted by the
housekeeping functionality of the jdbc river because they have a lower
version number than the the version number of the jdbc river.

How does the jdbc river deal with new document ? Is the version of
newly added rows automatically set to the one that is stored in the
jdbc river index ?
If so when manually adding documents to the ES index should I use
version_type = external and set the version to the one of the jdbc river ?

cheers
Ümit

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Umit_Seren · July 10, 2013, 12:35pm

@Jörg thanks for the quick reply.
So to summarize if I understood it the workflow correctly when a JDBC river
runs:

1.) Fetches all rows (specified by the SQL statement in the river) from the
database.
2.) calculates a digest from (id,type and index) of all the fetched rows
(if new rows were added or rows were deleted this should change).
3.) for all rows re-index the documents. This will automatically increment
the version of each document.
4.) increment version of the river stored in the _river index (custom)
5.) if the calculated digest in #3 is different than the one that is stored
in the _river index then:

store it
run housekeeping function (deletes all docs with lower version
numbers).

So what I don't understand is what happens when a new row is added to the
index by the jdbc river ?
It should theoretically get the version id of 1.
But if this is the case then wouldn't it be automatically deleted by the
housekeeping if the version stored for the jdbc river run is higher than 1
?

How did you solve this? When you index new documents do you set the version
manually to the current jdbc river version ?

So last, should all documents in the index have the same version except
those that have been deleted in the SQL database ?

On Wed, Jul 10, 2013 at 2:14 PM, Jörg Prante joergprante@gmail.com wrote:

Thanks for your interest in JDBC river.

The JDBC river uses external versioning to solve a hard challenge, the
task of moving tabular data (SQL) to hierarchical documents (JSON). By
doing this by an SQL select statament with labeled column names (dot
notation) arbitrary JSON can be build out of rows. It is assumed a single
result row generates a single JSON doc.

In a river, there are three types of op types: create, index, delete.

So I have to deal with updates to existing JSON docs. Well, I overwrite
this docs (which is not optimal, it neglects partial updates).

Then, adding new docs. When new docs are added, there are two
possibilities of interpretation: a new whole index, or incremental adding.

Deleting data is most sensitive. Unfortunately, SQL deletes do not
propagate to the index from the RDBMS. To detect row deletions, I use
generations of SQL selects. Each river cycle is a generation. The
generation is the external version in ES. To ensure only one generation
exists, I have implemented a "housekeeper" that deletes all(!) docs that
have a lower version.

I'm aware this is not ideal for every situation. For example, it is
incompatible to incremental updates, which is expected intuitively by many
users.

So I prepared a very modular architecture (river source, river flow, river
mouth) so if you need another philosophy it should be quite easy to add
your customized strategy to the code by reusing much of the existing code.
The ES community is so smart and strong, I think there must be more JDBC to
ES strategies.

I have not thought about manipulating docs manually in an existing JDBC
river index. There is a generation number in the river but it is unstable
(it will be reset at a river restart). The "housekeeper" is very naive and
can be confused by manual doc insertions.

Jörg

Am 10.07.13 09:57, schrieb Ümit Seren:

I have a couple of jdbc rivers that run every hour (using the simple
strategy). They work fine.

Now I have the requirement that new records should be "instantly" visible
in the search and also in some lists that get populated by ES.
In oder to do this I manually add the new records to ES as soon as they
are persisted in the DB.
However after the next river run those documents get deleted by the
housekeeping functionality of the jdbc river because they have a lower
version number than the the version number of the jdbc river.

How does the jdbc river deal with new document ? Is the version of newly
added rows automatically set to the one that is stored in the jdbc river
index ?
If so when manually adding documents to the ES index should I use
version_type = external and set the version to the one of the jdbc river ?

cheers
Ümit

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@**googlegroups.com elasticsearch%2Bunsubscribe@googlegroups.com
.

For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**oq48UN07wME/unsubscribehttps://groups.google.com/d/topic/elasticsearch/oq48UN07wME/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@**googlegroups.com elasticsearch%2Bunsubscribe@googlegroups.com
.
For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · July 10, 2013, 12:48pm

Am 10.07.13 14:35, schrieb Ümit Seren:

When you index new documents do you set the version manually to the
current jdbc river version ?

Yes, I explicitly set the version in the new doc to the current
generation number, not to "1".

So last, should all documents in the index have the same version
except those that have been deleted in the SQL database ?
That's the idea of the simple strategy, yes.

There is another strategy (I called it "table" strategy). The idea is to
control the ES index from outside the river, at RDBMS side. There is no
housekeeper and no version fiddling, instead, a special table with
index/type/id must be prepared by a DBA, and this table is copied
straightforward, like a mirror. It is not well tested.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Umit_Seren · July 10, 2013, 1:07pm

Thanks! Now I start to get a better picture.
I know about the table strategy but the nice thing about the simple
strategy is that it's pull rather than push (with the table strategy I have
to actually put the rows into the jdbc river table).

One solution for my problem would be to initially do a oneshot strategy to
get all data into ES and from there on I could manually update the index.
However this way I have to make sure that everywhere in my business
logic/backend where I change data in the DB I also have to re-index it in
ES. That means more maintainance.

One last unrelated question:
I was not able to index a nested document.
I tried to do it the way it was described in the wiki about structured
documents (person.name, person.position).
However it always ends up being indexes as an object:

person : {
name: ['joe','martin'],
position: ['staff','admin']
}

instead of

person : [
{
'name':'joe',
'position':'staff'
},
{
'name':'martin',
'position':'admin'

]

Is there a solution for that ?
I read in the issue list that you added pseudo column _json. Can I use that
for indexing nested documents ?

On Wed, Jul 10, 2013 at 2:48 PM, Jörg Prante joergprante@gmail.com wrote:

Am 10.07.13 14:35, schrieb Ümit Seren:

When you index new documents do you set the version manually to the

current jdbc river version ?

Yes, I explicitly set the version in the new doc to the current generation
number, not to "1".

So last, should all documents in the index have the same version except
those that have been deleted in the SQL database ?

That's the idea of the simple strategy, yes.

There is another strategy (I called it "table" strategy). The idea is to
control the ES index from outside the river, at RDBMS side. There is no
housekeeper and no version fiddling, instead, a special table with
index/type/id must be prepared by a DBA, and this table is copied
straightforward, like a mirror. It is not well tested.

Jörg

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**oq48UN07wME/unsubscribehttps://groups.google.com/d/topic/elasticsearch/oq48UN07wME/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@**googlegroups.com elasticsearch%2Bunsubscribe@googlegroups.com
.
For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · July 10, 2013, 2:03pm

Am 10.07.13 15:07, schrieb Ümit Seren:

Is there a solution for that ?
No solution yet. The missing feature is entity numbering (something like
'person.0.name', 'person.0.position', 'person.1.name',
'person.1.position'). Patches welcome

I read in the issue list that you added pseudo column _json. Can I use
that for indexing nested documents ?

You can store lexical JSON in an SQL column and this can be moved as
document to ES as is. Usually, this is not the popular way to store data
in columns.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Umit_Seren · July 10, 2013, 2:12pm

Thanks for the headsup.
If I find a little bit time I might take a crack on the entity numbering.

Regarding the JSON.
Theoretically I could modify the SQL statement to return a JSON string as
one of the columns. Could I solve my nested problems this way or does the
JDBC river treat the one json column as entire document ?

On Wed, Jul 10, 2013 at 4:03 PM, Jörg Prante joergprante@gmail.com wrote:

Am 10.07.13 15:07, schrieb Ümit Seren:

Is there a solution for that ?

No solution yet. The missing feature is entity numbering (something like '
person.0.name', 'person.0.position', 'person.1.name',
'person.1.position'). Patches welcome

I read in the issue list that you added pseudo column _json. Can I use

that for indexing nested documents ?

You can store lexical JSON in an SQL column and this can be moved as
document to ES as is. Usually, this is not the popular way to store data in
columns.

Jörg

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**oq48UN07wME/unsubscribehttps://groups.google.com/d/topic/elasticsearch/oq48UN07wME/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@**googlegroups.com elasticsearch%2Bunsubscribe@googlegroups.com
.
For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · July 10, 2013, 4:37pm

Am 10.07.13 16:12, schrieb Ümit Seren:

Thanks for the headsup.
If I find a little bit time I might take a crack on the entity numbering.

Regarding the JSON.
Theoretically I could modify the SQL statement to return a JSON string
as one of the columns. Could I solve my nested problems this way or
does the JDBC river treat the one json column as entire document ?

The _json column is treated like the _source in the ES index (i.e. the
source() parameter in the Java API).

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
How to update/delete indexed documents from ES index using mysql jdbc river Elasticsearch	4	2512	July 6, 2017
[Ann] Version 1.2.1 of JDBC river plugin, handling (most) of the JDBC types Elasticsearch	3	410	July 6, 2017
JDBC river update mechanism Elasticsearch	2	387	July 6, 2017
Jdbc river Elasticsearch	4	326	July 6, 2017
Best practice on getting data out of RDBMS(PostgreSQL)? Elasticsearch	2	443	July 6, 2017

Jdbc river, versioning and manual updates

Related topics