jdbcRiver rebuilding after restart


(Garrett Johnson) #1

I have a river created using the column strategy.

Whenever the system reboots or restarts it appears that the river is
forgetting it's last place and starts all over again.

These systems will do updates from time to time and rebuilding the indexes
is taking a really long time slowing down performance of elastic search.

Here's an example of one of the built river documents.

PUT _river/mytypename/_meta
{
"type":"jdbc",
"jdbc": {

"url":"jdbc:sqlserver://servername.mydomain.com:1433;databaseName=mydbname;integratedSecurity=true;",
"sql":"SELECT id as _id, * FROM [mydbname].[dbo].[mytypename]",
"strategy":"column",
"created_at":"createddate",
"updated_at":"lastmodifieddate",
"schedule":"0 0-55 0-23 ? * *",
"autocommit":true,
"index":"jdbc",
"type":"mytypename"
}
}

Any ideas on how to handle this? It takes about 6 hours for it to get all
the data and during that time ElasticSearch becomes very slow.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/23224844-e417-4da2-824b-e38c8299e13f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Jörg Prante) #2

It is up to the SQL statement to control the rows that are fetched when the
JDBC river restarts.

Note that rivers are deprecated. One of the reason because rivers are
obsoleted is the undefined state if a node restarts. JDBC river simply
re-runs the SQL statement.

Use the JDBC plugin in standalone feeder mode for better control what data
is indexed, e.g. one initial run to index all data, and a cron job for
incremental data.

Jörg

On Sun, Apr 19, 2015 at 6:47 PM, GWired garrettcjohnson@gmail.com wrote:

I have a river created using the column strategy.

Whenever the system reboots or restarts it appears that the river is
forgetting it's last place and starts all over again.

These systems will do updates from time to time and rebuilding the indexes
is taking a really long time slowing down performance of elastic search.

Here's an example of one of the built river documents.

PUT _river/mytypename/_meta
{
"type":"jdbc",
"jdbc": {
"url":"jdbc:sqlserver://servername.mydomain.com:1433
;databaseName=mydbname;integratedSecurity=true;",
"sql":"SELECT id as _id, * FROM [mydbname].[dbo].[mytypename]",
"strategy":"column",
"created_at":"createddate",
"updated_at":"lastmodifieddate",
"schedule":"0 0-55 0-23 ? * *",
"autocommit":true,
"index":"jdbc",
"type":"mytypename"
}
}

Any ideas on how to handle this? It takes about 6 hours for it to get all
the data and during that time ElasticSearch becomes very slow.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/23224844-e417-4da2-824b-e38c8299e13f%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/23224844-e417-4da2-824b-e38c8299e13f%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHAEfnVvexevz3is_xRqFOPFJx%2BXLMzF4wp-JqPjDajhg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Garrett Johnson) #3

I can't look at the feeder setup now but I could in the future.

Is my SQL statement incorrect?

Should I be doing something differently?

Does the river not utilize created_at and updated_at in this setup? I
don't have a where clause because I thought using the column strategy it
would take that in to account.

This is an example of what I see in SQL server:

SELECT id as _id, * FROM [MyDBName].[dbo].[MyTableName] WHERE ({fn
TIMESTAMPDIFF(SQL_TSI_SECOND,@P0,"createddate")} >= 0)

Which when i populate the @P0 with a timestamp it seems to be working fine.

On a restart I'm guessing it doesn't know when to start.

Any way that I can check values in elasticsearch within the column
strategy? Such as using Max(CreatedDate) so that it can start there?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/06e9ce54-8b71-4337-971b-440a5b56f00d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Jörg Prante) #4

The column strategy is a community effort, it can manipulate SQL statement
where clauses with timestamp filter.

I do not have enough knowledge about column strategy.

You are correct, at node restart, a river does not know from where to
restart. There is no method to resolve this within river logic.

Jörg

On Mon, Apr 20, 2015 at 2:11 PM, GWired garrettcjohnson@gmail.com wrote:

I can't look at the feeder setup now but I could in the future.

Is my SQL statement incorrect?

Should I be doing something differently?

Does the river not utilize created_at and updated_at in this setup? I
don't have a where clause because I thought using the column strategy it
would take that in to account.

This is an example of what I see in SQL server:

SELECT id as _id, * FROM [MyDBName].[dbo].[MyTableName] WHERE ({fn
TIMESTAMPDIFF(SQL_TSI_SECOND,@P0,"createddate")} >= 0)

Which when i populate the @P0 with a timestamp it seems to be working fine.

On a restart I'm guessing it doesn't know when to start.

Any way that I can check values in elasticsearch within the column
strategy? Such as using Max(CreatedDate) so that it can start there?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/06e9ce54-8b71-4337-971b-440a5b56f00d%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/06e9ce54-8b71-4337-971b-440a5b56f00d%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFOXtchcMm%3D3PN2gwA6P%2B%2BZoDNtSpwRAk51e2yNXuAdNQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Garrett Johnson) #5

I fixed this by using Stored Procedures. I store the last run dates in my
SQL database and group them by month so as not to overload my under powered
VM's.

So far so good this approach works much better than the column strategy.

On Monday, April 20, 2015 at 10:59:31 AM UTC-4, Jörg Prante wrote:

The column strategy is a community effort, it can manipulate SQL statement
where clauses with timestamp filter.

I do not have enough knowledge about column strategy.

You are correct, at node restart, a river does not know from where to
restart. There is no method to resolve this within river logic.

Jörg

On Mon, Apr 20, 2015 at 2:11 PM, GWired <garrett...@gmail.com
<javascript:>> wrote:

I can't look at the feeder setup now but I could in the future.

Is my SQL statement incorrect?

Should I be doing something differently?

Does the river not utilize created_at and updated_at in this setup? I
don't have a where clause because I thought using the column strategy it
would take that in to account.

This is an example of what I see in SQL server:

SELECT id as _id, * FROM [MyDBName].[dbo].[MyTableName] WHERE ({fn
TIMESTAMPDIFF(SQL_TSI_SECOND,@P0,"createddate")} >= 0)

Which when i populate the @P0 with a timestamp it seems to be working
fine.

On a restart I'm guessing it doesn't know when to start.

Any way that I can check values in elasticsearch within the column
strategy? Such as using Max(CreatedDate) so that it can start there?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/06e9ce54-8b71-4337-971b-440a5b56f00d%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/06e9ce54-8b71-4337-971b-440a5b56f00d%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8ee4db61-d7af-459f-9152-8c808d8f3f9e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #6