Options for synchronizing SQL (Oracle) with Elasticsearch

Hello,

I have quite a lot of data in Oracle and I need to keep all its data
indexed in ES. I'm having trouble finding a good solution, so I hope you
guys can give me some feedback.

My use-case goes like this:

  • there are lots of rows (hundreds of million) in quite a lot of tables
    (10-20)
  • the resulting documents are rather large, because they contain data from
    all those tables. What's more, they're structured and there are some
    arrays. Like, the document is the user, and the user can belong to multiple
    groups
  • so far the best fit seems to be Jörg's JDBC river[0], but I'm a bit
    worried about managing updates. The "simple" strategy seems to need to look
    at all data from Oracle, and I would like to avoid that. And I'm a bit
    unclear about the "table" strategy - it seems like I would need to manage
    the timestamp and versioning from Oracle?

Ideally, what I'd want to have is something like the MongoDB river does
with the operations log. If that's not possible, what would you recommend
for this use-case?

[0] https://github.com/jprante/elasticsearch-river-jdbc

Thanks and best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Radu,

a little bit sparsely documented, there is an "acknowledge statement"
mechanism implemented since 2.0.0 in the JDBC river.

The idea is that two SQL operations are executed per cycle:

  • one that selects rows from the RDBMS and constructs JSON objects
  • a second "acknowledge statement" that delete or update those rows

This "handshake" method must use a "job ID" column in the RDBMS data
(which is nothing but a version number in most cases)

The two statements are executed on two JDBC connections, a read and a
write connection, to ease the transaction management on the RDBMS side.

There are almost no limits for the "job ID" column (but, some DBs, not
Oracle, may not easily move large result sets due to some JDBC driver
dependent oddities). The only task is to set up a suitable time interval
with the "poll" parameter in the river und a mechanism on DB side to
generate the "job IDs" in sequence and to create the data to be pushed.

The JDBC river relies on the database connectivity given by JDBC. You
are right about MonogDB log-based approach, this is most attractive with
the ES river philosophy. With Oracle Streams, it should be possible to
use JMS to connect to Oracle in a stream-like fashion. But, there is no
JMS river yet: JMS River · Issue #1109 · elastic/elasticsearch · GitHub

Best regards,

Jörg

Am 14.02.13 14:58, schrieb Radu Gheorghe:

Hello,

I have quite a lot of data in Oracle and I need to keep all its data
indexed in ES. I'm having trouble finding a good solution, so I hope
you guys can give me some feedback.

My use-case goes like this:

  • there are lots of rows (hundreds of million) in quite a lot of
    tables (10-20)
  • the resulting documents are rather large, because they contain data
    from all those tables. What's more, they're structured and there are
    some arrays. Like, the document is the user, and the user can belong
    to multiple groups
  • so far the best fit seems to be Jörg's JDBC river[0], but I'm a bit
    worried about managing updates. The "simple" strategy seems to need to
    look at all data from Oracle, and I would like to avoid that. And I'm
    a bit unclear about the "table" strategy - it seems like I would need
    to manage the timestamp and versioning from Oracle?

Ideally, what I'd want to have is something like the MongoDB river
does with the operations log. If that's not possible, what would you
recommend for this use-case?

[0] GitHub - jprante/elasticsearch-jdbc: JDBC importer for Elasticsearch

Thanks and best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Jörg,

Thanks a lot for replying! I'll ask some more inline :slight_smile:

On Thu, Feb 14, 2013 at 5:06 PM, Jörg Prante joergprante@gmail.com wrote:

Hi Radu,

a little bit sparsely documented, there is an "acknowledge statement"
mechanism implemented since 2.0.0 in the JDBC river.

The idea is that two SQL operations are executed per cycle:

  • one that selects rows from the RDBMS and constructs JSON objects
  • a second "acknowledge statement" that delete or update those rows

This "handshake" method must use a "job ID" column in the RDBMS data
(which is nothing but a version number in most cases)

The two statements are executed on two JDBC connections, a read and a
write connection, to ease the transaction management on the RDBMS side.

There are almost no limits for the "job ID" column (but, some DBs, not
Oracle, may not easily move large result sets due to some JDBC driver
dependent oddities). The only task is to set up a suitable time interval
with the "poll" parameter in the river und a mechanism on DB side to
generate the "job IDs" in sequence and to create the data to be pushed.

I'm not sure I followed. So I guess I'll have to make a new column for
every table named jobID or something like that, which will be updated with
every operation. That seems clear.

Then, how do I create the data that needs to be pushed? Do the joins and
put that data in a different table, so the JDBC river can pull it using the
"table" strategy?

If yes, then does the JDBC river automatically remove processed records
from that table? Or should I do that manually?

If none of the above should be done, I guess that table will pretty much
duplicate the data from the DB. Or am I missing something?

The JDBC river relies on the database connectivity given by JDBC. You are
right about MonogDB log-based approach, this is most attractive with the ES
river philosophy. With Oracle Streams, it should be possible to use JMS to
connect to Oracle in a stream-like fashion. But, there is no JMS river yet:
https://github.com/**elasticsearch/elasticsearch/**issues/1109https://github.com/elasticsearch/elasticsearch/issues/1109

Thanks for the pointer. At this point I see two "paths":

  • build ES documents from the DB everytime, and try to filter by
    version/timestamp so I won't have to build all documents. Then re-index
    them
  • keep track of all changes for every table. Then for every change use the
    Update API to apply each change to the corresponding ES document(s). In my
    use-case I can easily track the ES ID where each row should go, so that
    might work. The downside of this is having to apply updates one by one (no
    bulk)

Is there another option?

Best regards,
Radu

Best regards,

Jörg

Am 14.02.13 14:58, schrieb Radu Gheorghe:

Hello,

I have quite a lot of data in Oracle and I need to keep all its data
indexed in ES. I'm having trouble finding a good solution, so I hope you
guys can give me some feedback.

My use-case goes like this:

  • there are lots of rows (hundreds of million) in quite a lot of tables
    (10-20)
  • the resulting documents are rather large, because they contain data
    from all those tables. What's more, they're structured and there are some
    arrays. Like, the document is the user, and the user can belong to multiple
    groups
  • so far the best fit seems to be Jörg's JDBC river[0], but I'm a bit
    worried about managing updates. The "simple" strategy seems to need to look
    at all data from Oracle, and I would like to avoid that. And I'm a bit
    unclear about the "table" strategy - it seems like I would need to manage
    the timestamp and versioning from Oracle?

Ideally, what I'd want to have is something like the MongoDB river does
with the operations log. If that's not possible, what would you recommend
for this use-case?

[0] https://github.com/jprante/**elasticsearch-river-jdbchttps://github.com/jprante/elasticsearch-river-jdbc

Thanks and best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@**googlegroups.comelasticsearch%2Bunsubscribe@googlegroups.com
.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@**googlegroups.comelasticsearch%2Bunsubscribe@googlegroups.com
.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Radu,

yes, there are different tasks, when I think in "ETL-style": baseline
loading (what you mention about the millions of rows) and incremental
updates (inserts/updates/deletes).

Baseline loading is straight forward (the "oneshot" JDB river strategy).

For the incremental updates, there are options:

  • the DB "as is" must be examined for inserts/updates/deletes (like a
    "diff" operates on text lin files)
  • volatile data is created at DB side to get pushed 1:1 to ES
    (stream-like, the "table" strategy tries to do this, but is not well tested)

At ES side, there are challenges:

  • stale data: data not in DB, but in ES
  • correct order of insert/update/delete (ES is not transactional)
  • synchronization (versioning), e.g. if the river was disabled for a
    while, how can a river missing data detect and how can a river reset to
    the correct point in time

The JDBC river offers some rudimentary approaches:

  • it can use the versioning feature in ES for the river poll cycles
  • it can compute checksums over fetched data to detect differences
    between river poll cycles, and a housekeeper thread to delete documents
    with older versions

What is missing:

  • a strategy for rolling indices (timestamp-based polls could be
    directed to aliased indexes)

I think there is no one-size-fits-all solution, but I'd be happy about
suggestions that can be added to the JDBC river, and maybe also useful
for other rivers as well!

Best regards,

Jörg

Am 15.02.13 10:54, schrieb Radu Gheorghe:

Thanks for the pointer. At this point I see two "paths":

  • build ES documents from the DB everytime, and try to filter by
    version/timestamp so I won't have to build all documents. Then
    re-index them
  • keep track of all changes for every table. Then for every change use
    the Update API to apply each change to the corresponding ES
    document(s). In my use-case I can easily track the ES ID where each
    row should go, so that might work. The downside of this is having to
    apply updates one by one (no bulk)

Is there another option?

Best regards,
Radu

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Jörg,

On Fri, Feb 15, 2013 at 1:22 PM, Jörg Prante joergprante@gmail.com wrote:

Hi Radu,

yes, there are different tasks, when I think in "ETL-style": baseline
loading (what you mention about the millions of rows) and incremental
updates (inserts/updates/deletes).

Right. That's my thinking as well.

Baseline loading is straight forward (the "oneshot" JDB river strategy).

Right. I got that covered.

For the incremental updates, there are options:

  • the DB "as is" must be examined for inserts/updates/deletes (like a
    "diff" operates on text lin files)
  • volatile data is created at DB side to get pushed 1:1 to ES
    (stream-like, the "table" strategy tries to do this, but is not well tested)

OK. The first option won't work for me because of size, but I'll look some
more at the "table" strategy. But doesn't this strategy imply that all the
documents that should get in ES (hundreds of millions in my case) have to
be in that table? Or does the JDBC river evict processed items?

At ES side, there are challenges:

  • stale data: data not in DB, but in ES
  • correct order of insert/update/delete (ES is not transactional)
  • synchronization (versioning), e.g. if the river was disabled for a
    while, how can a river missing data detect and how can a river reset to the
    correct point in time

Right. I'm aware of that.

The JDBC river offers some rudimentary approaches:

  • it can use the versioning feature in ES for the river poll cycles
  • it can compute checksums over fetched data to detect differences between
    river poll cycles, and a housekeeper thread to delete documents with older
    versions

What is missing:

  • a strategy for rolling indices (timestamp-based polls could be directed
    to aliased indexes)

I think there is no one-size-fits-all solution, but I'd be happy about
suggestions that can be added to the JDBC river, and maybe also useful for
other rivers as well!

Right. That's the key takeaway for me. I'll come up with suggestions and/or
pull requests to your river if I get any good idea that would apply.

Thanks a lot for your input!

Best regards,
Radu

Best regards,

Jörg

Am 15.02.13 10:54, schrieb Radu Gheorghe:

Thanks for the pointer. At this point I see two "paths":

  • build ES documents from the DB everytime, and try to filter by
    version/timestamp so I won't have to build all documents. Then re-index
    them
  • keep track of all changes for every table. Then for every change use
    the Update API to apply each change to the corresponding ES document(s). In
    my use-case I can easily track the ES ID where each row should go, so that
    might work. The downside of this is having to apply updates one by one (no
    bulk)

Is there another option?

Best regards,
Radu

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@**googlegroups.comelasticsearch%2Bunsubscribe@googlegroups.com
.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

No, you can first initiate a "oneshot" river. Afterwards, you can delete
this river (it runs only once) and you can use another river instance,
maybe a table-based one. It needs some testing for the JSON objects so
they match in the structure, but basically it should work.

Jörg

Am 15.02.13 12:51, schrieb Radu Gheorghe:

OK. The first option won't work for me because of size, but I'll look
some more at the "table" strategy. But doesn't this strategy imply
that all the documents that should get in ES (hundreds of millions in
my case) have to be in that table? Or does the JDBC river evict
processed items?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On Fri, Feb 15, 2013 at 2:25 PM, Jörg Prante joergprante@gmail.com wrote:

No, you can first initiate a "oneshot" river.

Got that. It would be the way to go for me.

Afterwards, you can delete this river (it runs only once) and you can use
another river instance, maybe a table-based one. It needs some testing for
the JSON objects so they match in the structure, but basically it should
work.

Yes, but I'll have to put the "ES docs" in that table as they get updated
in Oracle. Given that the river doesn't delete processed documents (or at
least I missed the part in the code where it does it), it should mean that
in quite a short time, that table will get huge.

I was thinking about extending the river with a new parameter to remove
processed documents from the source table. But if I do that it's going to
be difficult to manage deletes (like you said, with stale data). Maybe a
way to tackle that is by having an "is_deleted" column so the river knows
to remove all docs where is_deleted=1.

Does this sound like an option to you?

Best regards,
Radu

Jörg

Am 15.02.13 12:51, schrieb Radu Gheorghe:

OK. The first option won't work for me because of size, but I'll look

some more at the "table" strategy. But doesn't this strategy imply that all
the documents that should get in ES (hundreds of millions in my case) have
to be in that table? Or does the JDBC river evict processed items?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@**googlegroups.comelasticsearch%2Bunsubscribe@googlegroups.com
.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Yes, that's what I currently try to do with the acknowledging SQL
statement in the "simple" river strategy.

Example of such a handshake SQL statement pair (to sketch the idea how
it is supposed to work)

  1. select * from table where job_id =
  2. delete from table where job_id =

or

  1. select * from table where job_id = and received = 'N'
  2. update table set received = 'Y' where job_id =

Jörg

Am 15.02.13 13:44, schrieb Radu Gheorghe:

I was thinking about extending the river with a new parameter to
remove processed documents from the source table. But if I do that
it's going to be difficult to manage deletes (like you said, with
stale data). Maybe a way to tackle that is by having an "is_deleted"
column so the river knows to remove all docs where is_deleted=1.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks a lot, Jörg!

I'll try the Update API approach first (see a couple of Emails earlier),
but I'm not sure if this will be enough. So I'll probably come back to the
JDBC river soon :slight_smile:

Best regards,
Radu

On Fri, Feb 15, 2013 at 2:52 PM, Jörg Prante joergprante@gmail.com wrote:

Yes, that's what I currently try to do with the acknowledging SQL
statement in the "simple" river strategy.

Example of such a handshake SQL statement pair (to sketch the idea how it
is supposed to work)

  1. select * from table where job_id =
  2. delete from table where job_id =

or

  1. select * from table where job_id = and received = 'N'
  2. update table set received = 'Y' where job_id =

Jörg

Am 15.02.13 13:44, schrieb Radu Gheorghe:

I was thinking about extending the river with a new parameter to remove

processed documents from the source table. But if I do that it's going to
be difficult to manage deletes (like you said, with stale data). Maybe a
way to tackle that is by having an "is_deleted" column so the river knows
to remove all docs where is_deleted=1.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@**googlegroups.comelasticsearch%2Bunsubscribe@googlegroups.com
.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.