JDBC river update mechanism

Hi,

I am trying to figure out how to use the JDBC river
at https://github.com/jprante/elasticsearch-river-jdbc to create and
maintain an elasticsearch index.

If I understand it correctly, the "river table" managed mode of the river
requires me to create one row per object in the ES index, which includes
the full SQL statement to get the data for that object. This does not
really scale well (especially not for our 100M+ row table I would like to
index). I don't really see the reason to have to specify the SQL query for
every object, since this is very likely to be the exact same query for
all of them. Wouldn't it be easier to have that as part of the river
configuration, and assume that the same query is done for each object in
the index? i.e. something like:
{
"type" : "jdbc",
"jdbc" : {
"driver" : "com.mysql.jdbc.Driver",
"url" : "jdbc:mysql://db/schema",
"user" : "elasticsearch",
"password" : "elasticsearch",
"rivertable" : "true",
"sql": "select id, name from products"
},
"index" : {
"index" : "main",
"type" : "product"
}
}

Another approach to this problem could be to keep the same SQL table
description very close to how it is defined, but redefining the fields like
this:

source_sql: a query that would select multiple entries to
create/update/delete
_id: the field from the query above to use as the object id

Having it work like this, you would only need three entries in the river
table (one for each operation type), and you would have a table with
timestamps you can JOIN to in the SQL statements, allowing you to create
quite good queries for incremental updates.

(I actually misunderstood the instructions initially, thinking it worked
like this, but ended up being surprised I always got only one entry in my
elasticsearch index =))

What are your thoughts on these suggestions for extending the current river
mechanism?

-- Joakim

--

Hi Joakim,

thanks for trying out my JDBC river.

The "sql: statement" river mode, the default method, operates like you
describe. It uses a single SQL statement defined at ES river setup. The
idea is that ES admins control the data flow from DB to ES. The
disadvantage is that additional ES housekeeping of deleted or updated rows
is required, which is a large overhead.

The "rivertable: true" river mode was added to the river as a second method
to overcome this. In this method, there is no ES housekeeping anymore and
no "sql" statement control at ES river side. Instead, DBAs, who do not have
access to ES (or do not want access to ES, for whatever reason) can control
the object creation/deletion at DB side, with the help of a single
additional table.

Object creation from relational data is rather complex - given a single SQL
statement and expecting the result being mapped to a list of structured
JSON objects. What you observed is my workaround. I have allowed for each
object ID a separate SQL statement, which gives the DBA much construction
flexibility, using table joins etc. Well, this could be tedious for a
single data table where you simply iterate over the rows and use the row ID
as the object unique ID, because also the same river control parameters
need to be repeated over and over again at each row.

Configuring an SQL statement at DB side as a pattern would require a second
table, a control table beside a data table. I hesitate because it's not
easy to explain why more complex setups are required, I try to keep it
simple.

In the beginning, the JDBC river's primary goal was just being a quick
helper to get existing data somehow moved into ES for demonstration
purpose. But addressing user's requirements is getting important. I intend
to add some more connector modes in future versions. One could be "table
transport" where tables are just copied into ES as index types (with all
the problems of flat objects, they do not exploit the JSON structure).
Another one could be "periodic transport" where new data is created
periodically and configured at DB side. I received a pull request on this.

So, your suggestions and pull requests are very welcome!

Cheers,
Jörg

On Monday, October 29, 2012 6:21:19 PM UTC+1, Joakim Lundborg wrote:

Hi,

I am trying to figure out how to use the JDBC river at
GitHub - jprante/elasticsearch-jdbc: JDBC importer for Elasticsearch to create and
maintain an elasticsearch index.

If I understand it correctly, the "river table" managed mode of the river
requires me to create one row per object in the ES index, which includes
the full SQL statement to get the data for that object. This does not
really scale well (especially not for our 100M+ row table I would like to
index). I don't really see the reason to have to specify the SQL query for
every object, since this is very likely to be the exact same query for
all of them. Wouldn't it be easier to have that as part of the river
configuration, and assume that the same query is done for each object in
the index? i.e. something like:
{
"type" : "jdbc",
"jdbc" : {
"driver" : "com.mysql.jdbc.Driver",
"url" : "jdbc:mysql://db/schema",
"user" : "elasticsearch",
"password" : "elasticsearch",
"rivertable" : "true",
"sql": "select id, name from products"
},
"index" : {
"index" : "main",
"type" : "product"
}
}

Another approach to this problem could be to keep the same SQL table
description very close to how it is defined, but redefining the fields like
this:

source_sql: a query that would select multiple entries to
create/update/delete
_id: the field from the query above to use as the object id

Having it work like this, you would only need three entries in the river
table (one for each operation type), and you would have a table with
timestamps you can JOIN to in the SQL statements, allowing you to create
quite good queries for incremental updates.

(I actually misunderstood the instructions initially, thinking it worked
like this, but ended up being surprised I always got only one entry in my
elasticsearch index =))

What are your thoughts on these suggestions for extending the current
river mechanism?

-- Joakim

--