Jdbc river strategy

Erlendur_Hakonarson · November 25, 2014, 10:04am

Hi I am new to ES but my company is starting to use it

When I set up an river I have scheduled it to check for data changes at an
30 min interval, my largest index on dev includes 230k documents but in
production is expected to grow to 300million docs
this 230k index is a heavy load on the server when it checks for data
changes, puts the core in 100% for approx. 5 minutes.
It looks like it is reindexing the index every time, I am using simple
strategy, can someone show me where I can find documentation on the
different strategies?
here is a sample of my river statement:

{
"_index" : "_river",
"_type" : "transactions_test",
"_id" : "_meta",
"_score" : 1,
"_source" : {
"type" : "jdbc",
"jdbc" : {
"strategy" : "simple",
"url" : "db server connect string",
"user" : "username",
"schedule" : "0 20/30 * * * ?",
"password" : "password",
"index" : "transactions_test",
"type" : "transaction_test",
"sql" : "SELECT * from my_transaction_table"
}
}
},

best regards
Erlendur

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5c27c8d2-2f00-4e18-98f9-28f2af828d9e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ramy · November 25, 2014, 10:55am

Hi Erlendur,
In your case, you should use the column strategy instead of simple one. The
column strategy requires two columns in the SQL DB.

cerated_at
update_at
Cheers, Ramy

Am Dienstag, 25. November 2014 11:04:17 UTC+1 schrieb Erlendur Hákonarson:

Hi I am new to ES but my company is starting to use it

When I set up an river I have scheduled it to check for data changes at an
30 min interval, my largest index on dev includes 230k documents but in
production is expected to grow to 300million docs
this 230k index is a heavy load on the server when it checks for data
changes, puts the core in 100% for approx. 5 minutes.
It looks like it is reindexing the index every time, I am using simple
strategy, can someone show me where I can find documentation on the
different strategies?
here is a sample of my river statement:

{
"_index" : "_river",
"_type" : "transactions_test",
"_id" : "_meta",
"_score" : 1,
"_source" : {
"type" : "jdbc",
"jdbc" : {
"strategy" : "simple",
"url" : "db server connect string",
"user" : "username",
"schedule" : "0 20/30 * * * ?",
"password" : "password",
"index" : "transactions_test",
"type" : "transaction_test",
"sql" : "SELECT * from my_transaction_table"
}
}
},

best regards
Erlendur

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/151e2368-7e80-4f5d-a986-9e307f16046e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ramy · November 25, 2014, 11:00am

Sorry...

created_at
updated_at

Am Dienstag, 25. November 2014 11:55:18 UTC+1 schrieb Ramy:

Hi Erlendur,
In your case, you should use the column strategy instead of simple one.
The column strategy requires two columns in the SQL DB.

cerated_at

update_at
Cheers, Ramy

Am Dienstag, 25. November 2014 11:04:17 UTC+1 schrieb Erlendur Hákonarson:

Hi I am new to ES but my company is starting to use it

When I set up an river I have scheduled it to check for data changes at
an 30 min interval, my largest index on dev includes 230k documents but in
production is expected to grow to 300million docs
this 230k index is a heavy load on the server when it checks for data
changes, puts the core in 100% for approx. 5 minutes.
It looks like it is reindexing the index every time, I am using simple
strategy, can someone show me where I can find documentation on the
different strategies?
here is a sample of my river statement:

{
"_index" : "_river",
"_type" : "transactions_test",
"_id" : "_meta",
"_score" : 1,
"_source" : {
"type" : "jdbc",
"jdbc" : {
"strategy" : "simple",
"url" : "db server connect string",
"user" : "username",
"schedule" : "0 20/30 * * * ?",
"password" : "password",
"index" : "transactions_test",
"type" : "transaction_test",
"sql" : "SELECT * from my_transaction_table"
}
}
},

best regards
Erlendur

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c8ad2917-3621-4d32-81b8-0197aa10b16e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Erlendur_Hakonarson · November 25, 2014, 12:34pm

Thanks Ramy

but how does that strategy work
is there any doc on strategies I can view?
the only one I found was on the jprante github wiki and that only describes
the simple strategy

and if I am using tables from a system that I have no control over and
those columns created_at and updated_at are not in those tables?
am I maybe misunderstanding this column strategy?

best regards
Erlendur

On Tuesday, November 25, 2014 11:00:18 AM UTC, Ramy wrote:

Sorry...

created_at

updated_at

Am Dienstag, 25. November 2014 11:55:18 UTC+1 schrieb Ramy:

Hi Erlendur,
In your case, you should use the column strategy instead of simple one.
The column strategy requires two columns in the SQL DB.

cerated_at

update_at
Cheers, Ramy

Am Dienstag, 25. November 2014 11:04:17 UTC+1 schrieb Erlendur Hákonarson:

Hi I am new to ES but my company is starting to use it

When I set up an river I have scheduled it to check for data changes at
an 30 min interval, my largest index on dev includes 230k documents but in
production is expected to grow to 300million docs
this 230k index is a heavy load on the server when it checks for data
changes, puts the core in 100% for approx. 5 minutes.
It looks like it is reindexing the index every time, I am using simple
strategy, can someone show me where I can find documentation on the
different strategies?
here is a sample of my river statement:

{
"_index" : "_river",
"_type" : "transactions_test",
"_id" : "_meta",
"_score" : 1,
"_source" : {
"type" : "jdbc",
"jdbc" : {
"strategy" : "simple",
"url" : "db server connect string",
"user" : "username",
"schedule" : "0 20/30 * * * ?",
"password" : "password",
"index" : "transactions_test",
"type" : "transaction_test",
"sql" : "SELECT * from my_transaction_table"
}
}
},

best regards
Erlendur

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5d457915-2638-4934-ae0c-3f76bdcaf3e3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ramy · November 25, 2014, 1:18pm

so far i understood, you need both columns in your database table.
Otherwise the river wont be able to do the checks. The river compares the
updated_at date with own date. If that date equal or in the future than the
one from the river, the river try to update/insert your record to
elasticsearch.

take a look to the code snippet:

{
"strategy": "column",
"type": "jdbc",
"jdbc": {
"url": "db server connect string",
"user": "username",
"schedule": "0 20/30 * * * ?",
"password": "password",
"index": "transactions_test",
"type": "transaction_test",
"sql": "SELECT * from my_transaction_table"
}
}

Am Dienstag, 25. November 2014 13:34:08 UTC+1 schrieb Erlendur Hákonarson:

Thanks Ramy

but how does that strategy work
is there any doc on strategies I can view?
the only one I found was on the jprante github wiki and that only
describes the simple strategy

and if I am using tables from a system that I have no control over and
those columns created_at and updated_at are not in those tables?
am I maybe misunderstanding this column strategy?

best regards
Erlendur

On Tuesday, November 25, 2014 11:00:18 AM UTC, Ramy wrote:

Sorry...

created_at

updated_at

Am Dienstag, 25. November 2014 11:55:18 UTC+1 schrieb Ramy:

Hi Erlendur,
In your case, you should use the column strategy instead of simple one.
The column strategy requires two columns in the SQL DB.

cerated_at

update_at
Cheers, Ramy

Am Dienstag, 25. November 2014 11:04:17 UTC+1 schrieb Erlendur
Hákonarson:

Hi I am new to ES but my company is starting to use it

When I set up an river I have scheduled it to check for data changes at
an 30 min interval, my largest index on dev includes 230k documents but in
production is expected to grow to 300million docs
this 230k index is a heavy load on the server when it checks for data
changes, puts the core in 100% for approx. 5 minutes.
It looks like it is reindexing the index every time, I am using simple
strategy, can someone show me where I can find documentation on the
different strategies?
here is a sample of my river statement:

{
"_index" : "_river",
"_type" : "transactions_test",
"_id" : "_meta",
"_score" : 1,
"_source" : {
"type" : "jdbc",
"jdbc" : {
"strategy" : "simple",
"url" : "db server connect string",
"user" : "username",
"schedule" : "0 20/30 * * * ?",
"password" : "password",
"index" : "transactions_test",
"type" : "transaction_test",
"sql" : "SELECT * from my_transaction_table"
}
}
},

best regards
Erlendur

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8e681d42-b3bd-4e76-9254-712ada901a91%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ramy · November 25, 2014, 1:40pm

maybe this link will helps you:

github.com/jprante/elasticsearch-jdbc

New strategy: "column"

jprante:master ← psliwa:master

opened 08:52AM - 22 Nov 13 UTC

psliwa

+1524 -3

I have written new strategy for this river because two existing strategies didn'…t meet my requirements. Additional options for this strategy are: - columnCreatedAt (default: created_at) - columnUpdatedAt (default: updated_at) - columnDeletedAt (default: null) - columnEscape (default: true) I have provided tests for all db engines that you are using in this project except Oracle db. There are some encyclopedic info about this strategy: ## Updates with additional columns This strategy depends on additional columns that you have to add to your table with data that you want to index: - column with creation time - column with last update time - column with deletion time (optionally) - to support document deletion, you have to use soft delete pattern Strategy based on versioning is inefficient and every time whole table is reindexed even when only few documents have been updated. Strategy based on table in other hand forces your application to populate additional table, so you have to make changes in your application code. `column` strategy on single "river run" indexes only that documents that should be indexed (new, updated or deleted). If document didn't changed since previous river run, it will no be even processed by the river. Column strategy also doesn't force you to change you application code, except when your table that you want to index has no lifecycle timestamps (creation, last update or optionally deletion times). The main limitation is that document deletion is supported only when you use soft delete pattern. River rememers time of last run, so on current run are performed 3 queries: for new, updated and deleted documents. Proper where clause conditions for each for this queries are created and appended to `sql` option. When you have already WHERE in your sql, new conditions will be added to the begining of the WHERE clause. In other way WHERE clause will be created on the query end. Example: ``` ** sql: "SELECT * FROM products" #it will be changed to SELECT * FROM products WHERE "auto generated where" ** ** sql: "SELECT * FROM products WHERE category=5" #it will be changed to SELECT * FROM products WHERE "auto generated where" AND category=5 ** ``` Be cerfuly when you are using subqueries with where clause, because algorithm that change you query is very simple and it can break your query in that case.

and this code snippet:
{
"strategy": "column",
"type": "jdbc",
"jdbc": {
"url": "db server connect string",
"user": "username",
"schedule": "0 20/30 * * * ?",
"password": "password",
"index": "transactions_test",
"type": "transaction_test",
"sql": "SELECT * from my_transaction_table"
}
}

Am Dienstag, 25. November 2014 13:34:08 UTC+1 schrieb Erlendur Hákonarson:

Thanks Ramy

but how does that strategy work
is there any doc on strategies I can view?
the only one I found was on the jprante github wiki and that only
describes the simple strategy

and if I am using tables from a system that I have no control over and
those columns created_at and updated_at are not in those tables?
am I maybe misunderstanding this column strategy?

best regards
Erlendur

On Tuesday, November 25, 2014 11:00:18 AM UTC, Ramy wrote:

Sorry...

created_at

updated_at

Am Dienstag, 25. November 2014 11:55:18 UTC+1 schrieb Ramy:

Hi Erlendur,
In your case, you should use the column strategy instead of simple one.
The column strategy requires two columns in the SQL DB.

cerated_at

update_at
Cheers, Ramy

Am Dienstag, 25. November 2014 11:04:17 UTC+1 schrieb Erlendur
Hákonarson:

Hi I am new to ES but my company is starting to use it

When I set up an river I have scheduled it to check for data changes at
an 30 min interval, my largest index on dev includes 230k documents but in
production is expected to grow to 300million docs
this 230k index is a heavy load on the server when it checks for data
changes, puts the core in 100% for approx. 5 minutes.
It looks like it is reindexing the index every time, I am using simple
strategy, can someone show me where I can find documentation on the
different strategies?
here is a sample of my river statement:

{
"_index" : "_river",
"_type" : "transactions_test",
"_id" : "_meta",
"_score" : 1,
"_source" : {
"type" : "jdbc",
"jdbc" : {
"strategy" : "simple",
"url" : "db server connect string",
"user" : "username",
"schedule" : "0 20/30 * * * ?",
"password" : "password",
"index" : "transactions_test",
"type" : "transaction_test",
"sql" : "SELECT * from my_transaction_table"
}
}
},

best regards
Erlendur

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a543816c-28eb-43d8-ae36-694156b17833%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Erlendur_Hakonarson · November 26, 2014, 9:32am

Thanks again Ramy
This was very helpful

On Tuesday, November 25, 2014 1:40:56 PM UTC, Ramy wrote:

maybe this link will helps you:
New strategy: "column" by psliwa · Pull Request #137 · jprante/elasticsearch-jdbc · GitHub

and this code snippet:
{
"strategy": "column",
"type": "jdbc",
"jdbc": {
"url": "db server connect string",
"user": "username",
"schedule": "0 20/30 * * * ?",
"password": "password",
"index": "transactions_test",
"type": "transaction_test",
"sql": "SELECT * from my_transaction_table"
}
}

Am Dienstag, 25. November 2014 13:34:08 UTC+1 schrieb Erlendur Hákonarson:

Thanks Ramy

but how does that strategy work
is there any doc on strategies I can view?
the only one I found was on the jprante github wiki and that only
describes the simple strategy

and if I am using tables from a system that I have no control over and
those columns created_at and updated_at are not in those tables?
am I maybe misunderstanding this column strategy?

best regards
Erlendur

On Tuesday, November 25, 2014 11:00:18 AM UTC, Ramy wrote:

Sorry...

created_at

updated_at

Am Dienstag, 25. November 2014 11:55:18 UTC+1 schrieb Ramy:

Hi Erlendur,
In your case, you should use the column strategy instead of simple one.
The column strategy requires two columns in the SQL DB.

cerated_at

update_at
Cheers, Ramy

Am Dienstag, 25. November 2014 11:04:17 UTC+1 schrieb Erlendur
Hákonarson:

Hi I am new to ES but my company is starting to use it

When I set up an river I have scheduled it to check for data changes
at an 30 min interval, my largest index on dev includes 230k documents but
in production is expected to grow to 300million docs
this 230k index is a heavy load on the server when it checks for data
changes, puts the core in 100% for approx. 5 minutes.
It looks like it is reindexing the index every time, I am using simple
strategy, can someone show me where I can find documentation on the
different strategies?
here is a sample of my river statement:

{
"_index" : "_river",
"_type" : "transactions_test",
"_id" : "_meta",
"_score" : 1,
"_source" : {
"type" : "jdbc",
"jdbc" : {
"strategy" : "simple",
"url" : "db server connect string",
"user" : "username",
"schedule" : "0 20/30 * * * ?",
"password" : "password",
"index" : "transactions_test",
"type" : "transaction_test",
"sql" : "SELECT * from my_transaction_table"
}
}
},

best regards
Erlendur

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c0226052-b46a-4b7a-b8d1-a8b65c356ed3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
jdbcRiver rebuilding after restart Elasticsearch	5	381	July 6, 2017
JDBC river in production environment Elasticsearch	1	296	July 6, 2017
Keeping jdbc river in sync with DB Elasticsearch	5	486	July 5, 2017
Jdbc river re-indexing after each start of server? Elasticsearch	3	377	July 6, 2017
JDBC River wirh table strategy --- the table "my_jdbc_river_ack" is not updated? Elasticsearch	3	314	July 6, 2017

Jdbc river strategy

Related topics