[ANN] JDBC river for Elasticsearch 2.0.0

Hey,

JDBC river plugin for Elasticsearch 2.0.0 is out.

Glimpsing the features:

  • for Elasticsearch 0.20+
  • more stable, tested against Derby, H2, HSQLDB, MySQL, PostgreSQL
  • example code and SQL in the testng tests
  • new conceptual model: "simple" strategy, "oneshot" strategy, "table"
    strategy
  • strategy consists of three modules for better reuse and extensibility
    (also in other rivers): RiverSource, RiverMouth, RiverFlow
  • new bulk indexing, showcase of org.elasticsearch.action.bulk.BulkProcessor
  • many bug fixes
  • binaries moved to bintray

https://bintray.com/pkg/show/general/jprante/elasticsearch-plugins/elasticsearch-river-jdbc

Documentation is being reorganized on the github wiki, more to come!

Cheers,

Jörg

--

Hi Jörg,

1)I' m using a Oracle XE server on a win7 x64, where I've done:
1.1)unzip elasticsearch-0.20.2.zip
1.2)copy ojdbc14.jar to %ES_HOME%/lib/ojdbc14.jar
1.3)./bin/plugin -url http://bit.ly/U75w1N -install river-jdbc

  1. Within oracle squirrel client:
    2.1)Create Table ORDERS(
    ORDER_ID Number(7) Primary Key,
    description VARCHAR2(100));

Create Table OPTIONS(
OPTION_ID Number(7) Primary Key,
order_id Number(7),
option_name VARCHAR2(50),
CONSTRAINT fk_order
FOREIGN KEY (order_id)
REFERENCES ORDERS(order_id))

INSERT INTO ORDERS(ORDER_ID, description) values(1, '1st order');

INSERT INTO OPTIONS(OPTION_ID, ORDER_ID, option_name) values(1, 1, '1st
order option1');
INSERT INTO OPTIONS(OPTION_ID, ORDER_ID, option_name) values(2, 1, '1st
order option2');

2.2)sqplus->

SQL> select * from orders;

ORDER_ID DESCRIPTION


     1 1st order

SQL> select * from options;

OPTION_ID ORDER_ID OPTION_NAME


     1          1 1st order option1
     2          1 1st order option2

SQL> select 'order' as "_index", ord.order_id as "_id",ord.order_id as
"order.oId",option_name as "order.options" from orders ord inner join
options opt on ord.order_id = opt.order_id;

_index _id order.oId order.options



order 1 1 1st order option1
order 1 1 1st order option2

  1. On a ubuntu VM box, hosted on the same win7 x64, I executed:
    ubuntu@ubuntu-VirtualBox:~$ uname -a
    Linux ubuntu-VirtualBox 3.2.0-24-generic-pae #37-Ubuntu SMP Wed Apr 25
    10:47:59 UTC 2012 i686 i686 i386 GNU/Linux

ubuntu@ubuntu-VirtualBox:~$ curl -XPUT
'192.168.56.1:9200/_river/my_jdbc_river/_meta' -d '{
"type" : "jdbc",
"jdbc" : {
"driver" : "oracle.jdbc.OracleDriver",
"url" : "jdbc:oracle:thin:@localhost:1521/XE",
"user" : "",
"password" : "",
"poll" : "10s",
"sql" : "select \u0027order\u0027 as "_index", ord.order_id as
"_id",ord.order_id as "order.oId",option_name as "order.option" from
orders ord inner join

options opt on ord.order_id = opt.order_id"
}
}'

4)I'm searching for:
ubuntu@ubuntu-VirtualBox:~$ curl -XGET
'192.168.56.1:9200/order/jdbc/_search?pretty'
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ {
"_index" : "order",
"_type" : "jdbc",
"_id" : "1",
"_score" : 1.0, "_source" : {"order":{"oId":1,"option":"1st order
option2"},"_id":1,"_index":"order"}
} ]
}

Obviously is fetching only the 2nd record from the one-to-many relation,
the "1st order option1" is missing...where I've done wrong?

5)But I would like to have the following result:

"_score" : 1.0, "_source" : {"order":{"oId":1,"options":["1st order
option1","1st order option2"]},"_id":1,"_index":"order"}

that should be an equivalent result from your example:

index=relations id=Good
{"contact":{"employee":["Müller","Meier","Schulze"],"customer":"Good"}}

Can you please help me?with any hint?
Thanks
gabriel

--

If _id is the ID of your doc in Elasticsearch, your second record is erasing the first one as they have the same _id.

Le 25 janv. 2013 à 13:45, Gabriel Mirea gmirea@sqli.com a écrit :

Hi Jörg,

1)I' m using a Oracle XE server on a win7 x64, where I've done:
1.1)unzip elasticsearch-0.20.2.zip
1.2)copy ojdbc14.jar to %ES_HOME%/lib/ojdbc14.jar
1.3)./bin/plugin -url http://bit.ly/U75w1N -install river-jdbc

  1. Within oracle squirrel client:
    2.1)Create Table ORDERS(
    ORDER_ID Number(7) Primary Key,
    description VARCHAR2(100));

Create Table OPTIONS(
OPTION_ID Number(7) Primary Key,
order_id Number(7),
option_name VARCHAR2(50),
CONSTRAINT fk_order
FOREIGN KEY (order_id)
REFERENCES ORDERS(order_id))

INSERT INTO ORDERS(ORDER_ID, description) values(1, '1st order');

INSERT INTO OPTIONS(OPTION_ID, ORDER_ID, option_name) values(1, 1, '1st order option1');
INSERT INTO OPTIONS(OPTION_ID, ORDER_ID, option_name) values(2, 1, '1st order option2');

2.2)sqplus->

SQL> select * from orders;

ORDER_ID DESCRIPTION


     1		1st order

SQL> select * from options;

OPTION_ID ORDER_ID OPTION_NAME


     1          1 1st order option1
     2          1 1st order option2

SQL> select 'order' as "_index", ord.order_id as "_id",ord.order_id as "order.oId",option_name as "order.options" from orders ord inner join
options opt on ord.order_id = opt.order_id;

_index _id order.oId order.options


order 1 1 1st order option1
order 1 1 1st order option2

  1. On a ubuntu VM box, hosted on the same win7 x64, I executed:
    ubuntu@ubuntu-VirtualBox:~$ uname -a
    Linux ubuntu-VirtualBox 3.2.0-24-generic-pae #37-Ubuntu SMP Wed Apr 25 10:47:59 UTC 2012 i686 i686 i386 GNU/Linux

ubuntu@ubuntu-VirtualBox:~$ curl -XPUT '192.168.56.1:9200/_river/my_jdbc_river/_meta' -d '{
"type" : "jdbc",
"jdbc" : {
"driver" : "oracle.jdbc.OracleDriver",
"url" : "jdbc:oracle:thin:@localhost:1521/XE",
"user" : "",
"password" : "",
"poll" : "10s",
"sql" : "select \u0027order\u0027 as "_index", ord.order_id as "_id",ord.order_id as "order.oId",option_name as "order.option" from orders ord inner join

options opt on ord.order_id = opt.order_id"
}
}'

4)I'm searching for:
ubuntu@ubuntu-VirtualBox:~$ curl -XGET '192.168.56.1:9200/order/jdbc/_search?pretty'
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ {
"_index" : "order",
"_type" : "jdbc",
"_id" : "1",
"_score" : 1.0, "_source" : {"order":{"oId":1,"option":"1st order option2"},"_id":1,"_index":"order"}
} ]
}

Obviously is fetching only the 2nd record from the one-to-many relation, the "1st order option1" is missing...where I've done wrong?

5)But I would like to have the following result:

"_score" : 1.0, "_source" : {"order":{"oId":1,"options":["1st order option1","1st order option2"]},"_id":1,"_index":"order"}

that should be an equivalent result from your example:

index=relations id=Good {"contact":{"employee":["Müller","Meier","Schulze"],"customer":"Good"}}

Can you please help me?with any hint?
Thanks
gabriel

--

--

Great thanks for the plugin.
Just a small question. Is there a way how to craete parent-child entries
with the JDBC River plugin? At least I wasn't able to come up with a
solution with the previous version (haven't really looked into the new
version tough).

On Friday, January 18, 2013 12:30:16 AM UTC+1, Jörg Prante wrote:

Hey,

JDBC river plugin for Elasticsearch 2.0.0 is out.

https://github.com/jprante/elasticsearch-river-jdbc

Glimpsing the features:

  • for Elasticsearch 0.20+
  • more stable, tested against Derby, H2, HSQLDB, MySQL, PostgreSQL
  • example code and SQL in the testng tests
  • new conceptual model: "simple" strategy, "oneshot" strategy, "table"
    strategy
  • strategy consists of three modules for better reuse and extensibility
    (also in other rivers): RiverSource, RiverMouth, RiverFlow
  • new bulk indexing, showcase of
    org.elasticsearch.action.bulk.BulkProcessor
  • many bug fixes
  • binaries moved to bintray

https://bintray.com/pkg/show/general/jprante/elasticsearch-plugins/elasticsearch-river-jdbc

Documentation is being reorganized on the github wiki, more to come!

Cheers,

Jörg

--

Yes, you can specify a special column label beginning with a prefix
underscore '_parent' for creating children documents.

The _parent column value corresponds to the bulk document parameter _parent

http://www.elasticsearch.org/guide/reference/api/bulk.html

I just pushed a fix for missing _timestamp handling in the JDBC river.

Best regards,

Jörg

Am 26.01.13 11:50, schrieb Ümit Seren:

Great thanks for the plugin.
Just a small question. Is there a way how to craete parent-child
entries with the JDBC River plugin? At least I wasn't able to come up
with a solution with the previous version (haven't really looked into
the new version tough).

--

hmm...conceptually, I understand that records erasing is happening due to
_id overwriting... can you please provide an hint/example how to avoid it?
many thanks,
gabriel

Le vendredi 25 janvier 2013 18:05:26 UTC+2, David Pilato a écrit :

If _id is the ID of your doc in Elasticsearch, your second record is
erasing the first one as they have the same _id.

Le 25 janv. 2013 à 13:45, Gabriel Mirea <gmi...@sqli.com <javascript:>> a
écrit :

Hi Jörg,

1)I' m using a Oracle XE server on a win7 x64, where I've done:
1.1)unzip elasticsearch-0.20.2.zip
1.2)copy ojdbc14.jar to %ES_HOME%/lib/ojdbc14.jar
1.3)./bin/plugin -url http://bit.ly/U75w1N -install river-jdbc

  1. Within oracle squirrel client:
    2.1)Create Table ORDERS(
    ORDER_ID Number(7) Primary Key,
    description VARCHAR2(100));

Create Table OPTIONS(
OPTION_ID Number(7) Primary Key,
order_id Number(7),
option_name VARCHAR2(50),
CONSTRAINT fk_order
FOREIGN KEY (order_id)
REFERENCES ORDERS(order_id))

INSERT INTO ORDERS(ORDER_ID, description) values(1, '1st order');

INSERT INTO OPTIONS(OPTION_ID, ORDER_ID, option_name) values(1, 1, '1st
order option1');
INSERT INTO OPTIONS(OPTION_ID, ORDER_ID, option_name) values(2, 1, '1st
order option2');

2.2)sqplus->

SQL> select * from orders;

ORDER_ID DESCRIPTION


     1                1st order 

SQL> select * from options;

OPTION_ID ORDER_ID OPTION_NAME


     1          1 1st order option1 
     2          1 1st order option2 

SQL> select 'order' as "_index", ord.order_id as "_id",ord.order_id as
"order.oId",option_name as "order.options" from orders ord inner join
options opt on ord.order_id = opt.order_id;

_index _id order.oId order.options



order 1 1 1st order option1
order 1 1 1st order option2

  1. On a ubuntu VM box, hosted on the same win7 x64, I executed:
    ubuntu@ubuntu-VirtualBox:~$ uname -a
    Linux ubuntu-VirtualBox 3.2.0-24-generic-pae #37-Ubuntu SMP Wed Apr 25
    10:47:59 UTC 2012 i686 i686 i386 GNU/Linux

ubuntu@ubuntu-VirtualBox:~$ curl -XPUT '
192.168.56.1:9200/_river/my_jdbc_river/_meta' -d '{
"type" : "jdbc",
"jdbc" : {
"driver" : "oracle.jdbc.OracleDriver",
"url" : "jdbc:oracle:thin:@localhost:1521/XE",
"user" : "",
"password" : "",
"poll" : "10s",
"sql" : "select \u0027order\u0027 as "_index", ord.order_id as
"_id",ord.order_id as "order.oId",option_name as "order.option" from
orders ord inner join

options opt on ord.order_id = opt.order_id"
}
}'

4)I'm searching for:
ubuntu@ubuntu-VirtualBox:~$ curl -XGET '
192.168.56.1:9200/order/jdbc/_search?pretty'
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ {
"_index" : "order",
"_type" : "jdbc",
"_id" : "1",
"_score" : 1.0, "_source" : {"order":{"oId":1,"option":"1st order
option2"},"_id":1,"_index":"order"}
} ]
}

Obviously is fetching only the 2nd record from the one-to-many relation,
the "1st order option1" is missing...where I've done wrong?

5)But I would like to have the following result:

"_score" : 1.0, "_source" : {"order":{"oId":1,"options":["1st order
option1","1st order option2"]},"_id":1,"_index":"order"}

that should be an equivalent result from your example:

index=relations id=Good
{"contact":{"employee":["Müller","Meier","Schulze"],"customer":"Good"}}

Can you please help me?with any hint?
Thanks
gabriel

--

--

Hi Gabriel,

Thanks for your example. It should work. I have to check what is going
wrong and opened an issue for it.

Best reagrds,

Jörg

Am 27.01.13 22:53, schrieb Gabriel Mirea:

hmm...conceptually, I understand that records erasing is happening
due to _id overwriting... can you please provide an hint/example how
to avoid it?
many thanks,
gabriel

Le vendredi 25 janvier 2013 18:05:26 UTC+2, David Pilato a écrit :

If _id is the ID of your doc in Elasticsearch, your second record
is erasing the first one as they have the same _id.


Le 25 janv. 2013 Ã  13:45, Gabriel Mirea <gmi...@sqli.com
<javascript:>> a écrit :

> Hi Jörg,
>
> 1)I' m using a Oracle XE server on a win7 x64, where I've done:
> 1.1)unzip elasticsearch-0.20.2.zip
> 1.2)copy ojdbc14.jar to %ES_HOME%/lib/ojdbc14.jar
> 1.3)./bin/plugin -url http://bit.ly/U75w1N -install river-jdbc
>
> 2) Within oracle squirrel client:
> 2.1)Create Table ORDERS(
> ORDER_ID Number(7) Primary Key,
> description VARCHAR2(100));
>
> Create Table OPTIONS(
> OPTION_ID Number(7) Primary Key,
> order_id Number(7),
> option_name VARCHAR2(50),
> CONSTRAINT fk_order
>     FOREIGN KEY (order_id)
>     REFERENCES ORDERS(order_id))
>
> INSERT INTO ORDERS(ORDER_ID, description) values(1, '1st order');
>
> INSERT INTO OPTIONS(OPTION_ID, ORDER_ID, option_name) values(1,
1, '1st order option1');
> INSERT INTO OPTIONS(OPTION_ID, ORDER_ID, option_name) values(2,
1, '1st order option2');
>
> 2.2)sqplus->
>
>
> SQL> select * from orders;
>
>   ORDER_ID   DESCRIPTION
> ----------  
--------------------------------------------------------------------------------

>          1                1st order
>
> SQL> select * from options;
>
>  OPTION_ID   ORDER_ID OPTION_NAME
> ---------- ----------
--------------------------------------------------
>          1          1 1st order option1
>          2          1 1st order option2
>
> SQL> select 'order' as "_index", ord.order_id as
"_id",ord.order_id as "order.oId",option_name as "order.options"
from orders ord inner join
> options opt on ord.order_id = opt.order_id;
>
> _index        _id  order.oId order.options
> ------ ---------- ----------
--------------------------------------------------
> order           1          1 1st order option1
> order           1          1 1st order option2
>
>
> 3) On a ubuntu VM box, hosted on the same win7 x64, I executed:
> ubuntu@ubuntu-VirtualBox:~$ uname -a
> Linux ubuntu-VirtualBox 3.2.0-24-generic-pae #37-Ubuntu SMP Wed
Apr 25 10:47:59 UTC 2012 i686 i686 i386 GNU/Linux
>
> ubuntu@ubuntu-VirtualBox:~$ curl -XPUT
'192.168.56.1:9200/_river/my_jdbc_river/_meta
<http://192.168.56.1:9200/_river/my_jdbc_river/_meta>' -d '{
>     "type" : "jdbc",
>     "jdbc" : {
>         "driver" : "oracle.jdbc.OracleDriver",
>         "url" : "jdbc:oracle:thin:@localhost:1521/XE",
>         "user" : "",
>         "password" : "",
>         "poll" : "10s",
>         "sql" : "select \u0027order\u0027 as \"_index\",
ord.order_id as \"_id\",ord.order_id as \"order.oId\",option_name
as \"order.option\" from orders ord inner join
>
> options opt on ord.order_id = opt.order_id"
>     }
> }'
>
> 4)I'm searching for:
> ubuntu@ubuntu-VirtualBox:~$ curl -XGET
'192.168.56.1:9200/order/jdbc/_search?pretty
<http://192.168.56.1:9200/order/jdbc/_search?pretty>'
> {
>   "took" : 2,
>   "timed_out" : false,
>   "_shards" : {
>     "total" : 5,
>     "successful" : 5,
>     "failed" : 0
>   },
>   "hits" : {
>     "total" : 1,
>     "max_score" : 1.0,
>     "hits" : [ {
>       "_index" : "order",
>       "_type" : "jdbc",
>       "_id" : "1",
>       "_score" : 1.0, "_source" :
{"order":{"oId":1,"option":"1st order
option2"},"_id":1,"_index":"order"}
>     } ]
>   }
>
> Obviously is fetching only the 2nd record from the one-to-many
relation, the "1st order option1" is missing...where I've done wrong?
>
> 5)But I would like to have the following result:
>
> "_score" : 1.0, "_source" : {"order":{"oId":1,"options":["1st
order option1","1st order option2"]},"_id":1,"_index":"order"}
>
> that should be an equivalent result from your example:
>
>
> index=relations id=Good
{"contact":{"employee":["Müller","Meier","Schulze"],"customer":"Good"}}

>
> Can you please help me?with any hint?
> Thanks
> gabriel
>
> --
>
>

--

Hi everybody,

I'm a newbie in elasticsearch and I'm trying to make a mapping between some
mysql tables and elasticsearch. For this purpuse I use river-jdbc but I am
having some issues to represent (foreign keys).
After some search over the internet I found that there's an option for
elasticsearch to represent a parent-child link which could be the best
aproach to define a "foreign key" in elasticsearch.

The problem that I am facing is that even if I have read some explanations
of how to use the _parent prefix I don't know how to implement it in a
river-jdbc query in order to preserve mysql's structure represented in a
parent-child structure in elasticsearch.

The below JSON representation is what I intend to do using a river-jdbc
query.

document = {
"docid": 123456789,
"text": "This is an ingredient description",
"category": [{
"catid": '124122151',
"name": "avocato",
"product": [
{"prodid": "12452361"},
{"name": "Bio avacato"}
]
},
{
"catid": '124122153',
"name": "carrot",
"product": [
{"prodid": "12452362"},
{"name": "Bio carrot"}
]
}],
}

On Saturday, January 26, 2013 1:02:36 PM UTC+1, Jörg Prante wrote:

Yes, you can specify a special column label beginning with a prefix
underscore '_parent' for creating children documents.

https://github.com/jprante/elasticsearch-river-jdbc/wiki/Labeled-columns

The _parent column value corresponds to the bulk document parameter
_parent

http://www.elasticsearch.org/guide/reference/api/bulk.html

I just pushed a fix for missing _timestamp handling in the JDBC river.

Best regards,

Jörg

Am 26.01.13 11:50, schrieb Ümit Seren:

Great thanks for the plugin.
Just a small question. Is there a way how to craete parent-child
entries with the JDBC River plugin? At least I wasn't able to come up
with a solution with the previous version (haven't really looked into
the new version tough).

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

The idea is to run the river twice, once for the parents, and once for
the children docs.

It does not matter if parents exist, you can add children nevertheless.
So you can run the children river first and parent second or vice versa.

I think you are aware of it, parent/child in ES is not really a "foreign
key" surrogate. It helps to enhance performance because related
documents are stored nearby and there is a strict 1:n relationship. But,
unlike foreign keys in the relational DB context, there is no contraint
management and no rejection of deletes if your data in the index gets
inconstent.

Does this help?

Jörg

Am 06.03.13 16:38, schrieb Carlos Alberto López Servín:

Hi everybody,

I'm a newbie in elasticsearch and I'm trying to make a mapping between
some mysql tables and elasticsearch. For this purpuse I use
river-jdbc but I am having some issues to represent (foreign keys).
After some search over the internet I found that there's an option for
elasticsearch to represent a parent-child link which could be the best
aproach to define a "foreign key" in elasticsearch.

The problem that I am facing is that even if I have read some
explanations of how to use the _parent prefix I don't know how to
implement it in a river-jdbc query in order to preserve mysql's
structure represented in a parent-child structure in elasticsearch.

The below JSON representation is what I intend to do using a
river-jdbc query.

document = {
"docid": 123456789,
"text": "This is an ingredient description",
"category": [{
"catid": '124122151',
"name": "avocato",
"product": [
{"prodid": "12452361"},
{"name": "Bio avacato"}
]
},
{
"catid": '124122153',
"name": "carrot",
"product": [
{"prodid": "12452362"},
{"name": "Bio carrot"}
]
}],
}

On Saturday, January 26, 2013 1:02:36 PM UTC+1, Jörg Prante wrote:

Yes, you can specify a special column label beginning with a prefix
underscore '_parent' for creating children documents.

https://github.com/jprante/elasticsearch-river-jdbc/wiki/Labeled-columns
<https://github.com/jprante/elasticsearch-river-jdbc/wiki/Labeled-columns>


The _parent column value corresponds to the bulk document
parameter _parent

http://www.elasticsearch.org/guide/reference/api/bulk.html
<http://www.elasticsearch.org/guide/reference/api/bulk.html>

I just pushed a fix for missing _timestamp handling in the JDBC
river.

Best regards,

Jörg

Am 26.01.13 11:50, schrieb Ümit Seren:
> Great thanks for the plugin.
> Just a small question. Is there a way how to craete parent-child
> entries with the JDBC River plugin? At least I wasn't able to
come up
> with a solution with the previous version (haven't really looked
into
> the new version tough).
>
>

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.