Options for synchronizing SQL (Oracle) with Elasticsearch

radu_gheorghe · February 14, 2013, 1:58pm

Hello,

I have quite a lot of data in Oracle and I need to keep all its data
indexed in ES. I'm having trouble finding a good solution, so I hope you
guys can give me some feedback.

My use-case goes like this:

there are lots of rows (hundreds of million) in quite a lot of tables
(10-20)
the resulting documents are rather large, because they contain data from
all those tables. What's more, they're structured and there are some
arrays. Like, the document is the user, and the user can belong to multiple
groups
so far the best fit seems to be Jörg's JDBC river[0], but I'm a bit
worried about managing updates. The "simple" strategy seems to need to look
at all data from Oracle, and I would like to avoid that. And I'm a bit
unclear about the "table" strategy - it seems like I would need to manage
the timestamp and versioning from Oracle?

Ideally, what I'd want to have is something like the MongoDB river does
with the operations log. If that's not possible, what would you recommend
for this use-case?

[0] https://github.com/jprante/elasticsearch-river-jdbc

Thanks and best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · February 14, 2013, 3:06pm

Hi Radu,

a little bit sparsely documented, there is an "acknowledge statement"
mechanism implemented since 2.0.0 in the JDBC river.

The idea is that two SQL operations are executed per cycle:

one that selects rows from the RDBMS and constructs JSON objects
a second "acknowledge statement" that delete or update those rows

This "handshake" method must use a "job ID" column in the RDBMS data
(which is nothing but a version number in most cases)

The two statements are executed on two JDBC connections, a read and a
write connection, to ease the transaction management on the RDBMS side.

There are almost no limits for the "job ID" column (but, some DBs, not
Oracle, may not easily move large result sets due to some JDBC driver
dependent oddities). The only task is to set up a suitable time interval
with the "poll" parameter in the river und a mechanism on DB side to
generate the "job IDs" in sequence and to create the data to be pushed.

The JDBC river relies on the database connectivity given by JDBC. You
are right about MonogDB log-based approach, this is most attractive with
the ES river philosophy. With Oracle Streams, it should be possible to
use JMS to connect to Oracle in a stream-like fashion. But, there is no
JMS river yet: JMS River · Issue #1109 · elastic/elasticsearch · GitHub

Best regards,

Jörg

Am 14.02.13 14:58, schrieb Radu Gheorghe:

Hello,

I have quite a lot of data in Oracle and I need to keep all its data
indexed in ES. I'm having trouble finding a good solution, so I hope
you guys can give me some feedback.

My use-case goes like this:

there are lots of rows (hundreds of million) in quite a lot of
tables (10-20)

the resulting documents are rather large, because they contain data
from all those tables. What's more, they're structured and there are
some arrays. Like, the document is the user, and the user can belong
to multiple groups

so far the best fit seems to be Jörg's JDBC river[0], but I'm a bit
worried about managing updates. The "simple" strategy seems to need to
look at all data from Oracle, and I would like to avoid that. And I'm
a bit unclear about the "table" strategy - it seems like I would need
to manage the timestamp and versioning from Oracle?

Ideally, what I'd want to have is something like the MongoDB river
does with the operations log. If that's not possible, what would you
recommend for this use-case?

[0] GitHub - jprante/elasticsearch-jdbc: JDBC importer for Elasticsearch

Thanks and best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

radu_gheorghe · February 15, 2013, 9:54am

Hi Jörg,

Thanks a lot for replying! I'll ask some more inline

On Thu, Feb 14, 2013 at 5:06 PM, Jörg Prante joergprante@gmail.com wrote:

Hi Radu,

a little bit sparsely documented, there is an "acknowledge statement"
mechanism implemented since 2.0.0 in the JDBC river.

The idea is that two SQL operations are executed per cycle:

one that selects rows from the RDBMS and constructs JSON objects

a second "acknowledge statement" that delete or update those rows

This "handshake" method must use a "job ID" column in the RDBMS data
(which is nothing but a version number in most cases)

The two statements are executed on two JDBC connections, a read and a
write connection, to ease the transaction management on the RDBMS side.

There are almost no limits for the "job ID" column (but, some DBs, not
Oracle, may not easily move large result sets due to some JDBC driver
dependent oddities). The only task is to set up a suitable time interval
with the "poll" parameter in the river und a mechanism on DB side to
generate the "job IDs" in sequence and to create the data to be pushed.

I'm not sure I followed. So I guess I'll have to make a new column for
every table named jobID or something like that, which will be updated with
every operation. That seems clear.

Then, how do I create the data that needs to be pushed? Do the joins and
put that data in a different table, so the JDBC river can pull it using the
"table" strategy?

If yes, then does the JDBC river automatically remove processed records
from that table? Or should I do that manually?

If none of the above should be done, I guess that table will pretty much
duplicate the data from the DB. Or am I missing something?

The JDBC river relies on the database connectivity given by JDBC. You are
right about MonogDB log-based approach, this is most attractive with the ES
river philosophy. With Oracle Streams, it should be possible to use JMS to
connect to Oracle in a stream-like fashion. But, there is no JMS river yet:
https://github.com/**elasticsearch/elasticsearch/**issues/1109 https://github.com/elasticsearch/elasticsearch/issues/1109

Thanks for the pointer. At this point I see two "paths":

build ES documents from the DB everytime, and try to filter by
version/timestamp so I won't have to build all documents. Then re-index
them
keep track of all changes for every table. Then for every change use the
Update API to apply each change to the corresponding ES document(s). In my
use-case I can easily track the ES ID where each row should go, so that
might work. The downside of this is having to apply updates one by one (no
bulk)

Is there another option?

Best regards,
Radu

Best regards,

Jörg

Am 14.02.13 14:58, schrieb Radu Gheorghe:

Hello,

I have quite a lot of data in Oracle and I need to keep all its data
indexed in ES. I'm having trouble finding a good solution, so I hope you
guys can give me some feedback.

My use-case goes like this:

there are lots of rows (hundreds of million) in quite a lot of tables
(10-20)

the resulting documents are rather large, because they contain data
from all those tables. What's more, they're structured and there are some
arrays. Like, the document is the user, and the user can belong to multiple
groups

so far the best fit seems to be Jörg's JDBC river[0], but I'm a bit
worried about managing updates. The "simple" strategy seems to need to look
at all data from Oracle, and I would like to avoid that. And I'm a bit
unclear about the "table" strategy - it seems like I would need to manage
the timestamp and versioning from Oracle?

Ideally, what I'd want to have is something like the MongoDB river does
with the operations log. If that's not possible, what would you recommend
for this use-case?

[0] https://github.com/jprante/**elasticsearch-river-jdbc https://github.com/jprante/elasticsearch-river-jdbc

Thanks and best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@**googlegroups.com elasticsearch%2Bunsubscribe@googlegroups.com
.
For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@**googlegroups.com elasticsearch%2Bunsubscribe@googlegroups.com
.
For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · February 15, 2013, 11:22am

Hi Radu,

yes, there are different tasks, when I think in "ETL-style": baseline
loading (what you mention about the millions of rows) and incremental
updates (inserts/updates/deletes).

Baseline loading is straight forward (the "oneshot" JDB river strategy).

For the incremental updates, there are options:

the DB "as is" must be examined for inserts/updates/deletes (like a
"diff" operates on text lin files)
volatile data is created at DB side to get pushed 1:1 to ES
(stream-like, the "table" strategy tries to do this, but is not well tested)

At ES side, there are challenges:

stale data: data not in DB, but in ES
correct order of insert/update/delete (ES is not transactional)
synchronization (versioning), e.g. if the river was disabled for a
while, how can a river missing data detect and how can a river reset to
the correct point in time

The JDBC river offers some rudimentary approaches:

it can use the versioning feature in ES for the river poll cycles
it can compute checksums over fetched data to detect differences
between river poll cycles, and a housekeeper thread to delete documents
with older versions

What is missing:

a strategy for rolling indices (timestamp-based polls could be
directed to aliased indexes)

I think there is no one-size-fits-all solution, but I'd be happy about
suggestions that can be added to the JDBC river, and maybe also useful
for other rivers as well!

Best regards,

Jörg

Am 15.02.13 10:54, schrieb Radu Gheorghe:

Thanks for the pointer. At this point I see two "paths":

build ES documents from the DB everytime, and try to filter by
version/timestamp so I won't have to build all documents. Then
re-index them

keep track of all changes for every table. Then for every change use
the Update API to apply each change to the corresponding ES
document(s). In my use-case I can easily track the ES ID where each
row should go, so that might work. The downside of this is having to
apply updates one by one (no bulk)

Is there another option?

Best regards,
Radu

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

radu_gheorghe · February 15, 2013, 11:51am

Hi Jörg,

On Fri, Feb 15, 2013 at 1:22 PM, Jörg Prante joergprante@gmail.com wrote:

Hi Radu,

yes, there are different tasks, when I think in "ETL-style": baseline
loading (what you mention about the millions of rows) and incremental
updates (inserts/updates/deletes).

Right. That's my thinking as well.

Baseline loading is straight forward (the "oneshot" JDB river strategy).

Right. I got that covered.

For the incremental updates, there are options:

the DB "as is" must be examined for inserts/updates/deletes (like a
"diff" operates on text lin files)

volatile data is created at DB side to get pushed 1:1 to ES
(stream-like, the "table" strategy tries to do this, but is not well tested)

OK. The first option won't work for me because of size, but I'll look some
more at the "table" strategy. But doesn't this strategy imply that all the
documents that should get in ES (hundreds of millions in my case) have to
be in that table? Or does the JDBC river evict processed items?

At ES side, there are challenges:

stale data: data not in DB, but in ES

correct order of insert/update/delete (ES is not transactional)

synchronization (versioning), e.g. if the river was disabled for a
while, how can a river missing data detect and how can a river reset to the
correct point in time

Right. I'm aware of that.

The JDBC river offers some rudimentary approaches:

it can use the versioning feature in ES for the river poll cycles

it can compute checksums over fetched data to detect differences between
river poll cycles, and a housekeeper thread to delete documents with older
versions

What is missing:

a strategy for rolling indices (timestamp-based polls could be directed
to aliased indexes)

I think there is no one-size-fits-all solution, but I'd be happy about
suggestions that can be added to the JDBC river, and maybe also useful for
other rivers as well!

Right. That's the key takeaway for me. I'll come up with suggestions and/or
pull requests to your river if I get any good idea that would apply.

Thanks a lot for your input!

Best regards,
Radu

Best regards,

Jörg

Am 15.02.13 10:54, schrieb Radu Gheorghe:

Thanks for the pointer. At this point I see two "paths":

build ES documents from the DB everytime, and try to filter by
version/timestamp so I won't have to build all documents. Then re-index
them

keep track of all changes for every table. Then for every change use
the Update API to apply each change to the corresponding ES document(s). In
my use-case I can easily track the ES ID where each row should go, so that
might work. The downside of this is having to apply updates one by one (no
bulk)

Is there another option?

Best regards,
Radu

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@**googlegroups.com elasticsearch%2Bunsubscribe@googlegroups.com
.
For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · February 15, 2013, 12:25pm

No, you can first initiate a "oneshot" river. Afterwards, you can delete
this river (it runs only once) and you can use another river instance,
maybe a table-based one. It needs some testing for the JSON objects so
they match in the structure, but basically it should work.

Jörg

Am 15.02.13 12:51, schrieb Radu Gheorghe:

OK. The first option won't work for me because of size, but I'll look
some more at the "table" strategy. But doesn't this strategy imply
that all the documents that should get in ES (hundreds of millions in
my case) have to be in that table? Or does the JDBC river evict
processed items?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

radu_gheorghe · February 15, 2013, 12:44pm

On Fri, Feb 15, 2013 at 2:25 PM, Jörg Prante joergprante@gmail.com wrote:

No, you can first initiate a "oneshot" river.

Got that. It would be the way to go for me.

Afterwards, you can delete this river (it runs only once) and you can use
another river instance, maybe a table-based one. It needs some testing for
the JSON objects so they match in the structure, but basically it should
work.

Yes, but I'll have to put the "ES docs" in that table as they get updated
in Oracle. Given that the river doesn't delete processed documents (or at
least I missed the part in the code where it does it), it should mean that
in quite a short time, that table will get huge.

I was thinking about extending the river with a new parameter to remove
processed documents from the source table. But if I do that it's going to
be difficult to manage deletes (like you said, with stale data). Maybe a
way to tackle that is by having an "is_deleted" column so the river knows
to remove all docs where is_deleted=1.

Does this sound like an option to you?

Best regards,
Radu

Jörg

Am 15.02.13 12:51, schrieb Radu Gheorghe:

OK. The first option won't work for me because of size, but I'll look

some more at the "table" strategy. But doesn't this strategy imply that all
the documents that should get in ES (hundreds of millions in my case) have
to be in that table? Or does the JDBC river evict processed items?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@**googlegroups.com elasticsearch%2Bunsubscribe@googlegroups.com
.
For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · February 15, 2013, 12:52pm

Yes, that's what I currently try to do with the acknowledging SQL
statement in the "simple" river strategy.

Example of such a handshake SQL statement pair (to sketch the idea how
it is supposed to work)

select * from table where job_id =
delete from table where job_id =

or

select * from table where job_id = and received = 'N'
update table set received = 'Y' where job_id =

Jörg

Am 15.02.13 13:44, schrieb Radu Gheorghe:

I was thinking about extending the river with a new parameter to
remove processed documents from the source table. But if I do that
it's going to be difficult to manage deletes (like you said, with
stale data). Maybe a way to tackle that is by having an "is_deleted"
column so the river knows to remove all docs where is_deleted=1.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

radu_gheorghe · February 15, 2013, 5:50pm

Thanks a lot, Jörg!

I'll try the Update API approach first (see a couple of Emails earlier),
but I'm not sure if this will be enough. So I'll probably come back to the
JDBC river soon

Best regards,
Radu

On Fri, Feb 15, 2013 at 2:52 PM, Jörg Prante joergprante@gmail.com wrote:

Yes, that's what I currently try to do with the acknowledging SQL
statement in the "simple" river strategy.

Example of such a handshake SQL statement pair (to sketch the idea how it
is supposed to work)

select * from table where job_id =

delete from table where job_id =

or

select * from table where job_id = and received = 'N'

update table set received = 'Y' where job_id =

Jörg

Am 15.02.13 13:44, schrieb Radu Gheorghe:

I was thinking about extending the river with a new parameter to remove

processed documents from the source table. But if I do that it's going to
be difficult to manage deletes (like you said, with stale data). Maybe a
way to tackle that is by having an "is_deleted" column so the river knows
to remove all docs where is_deleted=1.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@**googlegroups.com elasticsearch%2Bunsubscribe@googlegroups.com
.
For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Keeping ES in sync with MySQL changes - JDBC River? Elasticsearch	8	1347	July 6, 2017
[Ann] JDBC River Plugin for ElasticSearch Elasticsearch	20	2762	July 6, 2017
Jdbc river, versioning and manual updates Elasticsearch	8	408	July 6, 2017
How to update/delete indexed documents from ES index using mysql jdbc river Elasticsearch	4	2536	July 6, 2017
Updating Elastic search index along with DB Elasticsearch	4	563	July 6, 2017

Options for synchronizing SQL (Oracle) with Elasticsearch

Thanks and best regards, Radu

Thanks and best regards, Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

Thanks and best regards, Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

Related topics

Thanks and best regards,
Radu

Thanks and best regards,
Radu

Thanks and best regards,
Radu