It works, however not as I expected. At the first run, the file D:\Elastic\Logstash717\logs.logstash_jdbc_last_run from last_run_metadata_path is initialized with 1970-01-01T00:00:00.000, but then, with only few sql executions, it quickly gets to current date:
[2023-10-12T05:09:22,032][INFO ][logstash.inputs.jdbc ][gos][d445a8b13f63b88db9814df5173b9b99d1b89b68c865011e671f3fb2ebeaabc0] (0.060240s) SELECT TOP(100) * from DBTABLE with(nolock) where LogDate > '1970-01-01T00:00:00.000' ORDER BY LogDate ASC
[2023-10-12T05:09:31,273][INFO ][logstash.inputs.jdbc ][gos][d445a8b13f63b88db9814df5173b9b99d1b89b68c865011e671f3fb2ebeaabc0] (0.007982s) SELECT TOP(100) * from DBTABLE with(nolock) where LogDate > '2023-10-11T08:00:04.093' ORDER BY LogDate ASC
[2023-10-12T05:09:40,225][INFO ][logstash.inputs.jdbc ][gos][d445a8b13f63b88db9814df5173b9b99d1b89b68c865011e671f3fb2ebeaabc0] (0.008682s) SELECT TOP(100) * from DBTABLE with(nolock) where LogDate > '2023-10-11T15:00:06.327' ORDER BY LogDate ASC
[2023-10-12T05:09:50,303][INFO ][logstash.inputs.jdbc ][gos][d445a8b13f63b88db9814df5173b9b99d1b89b68c865011e671f3fb2ebeaabc0] (0.009608s) SELECT TOP(100) * from DBTABLE with(nolock) where LogDate > '2023-10-11T22:00:12.550' ORDER BY LogDate ASC
[2023-10-12T05:10:00,341][INFO ][logstash.inputs.jdbc ][gos][d445a8b13f63b88db9814df5173b9b99d1b89b68c865011e671f3fb2ebeaabc0] (0.004309s) SELECT TOP(100) * from DBTABLE with(nolock) where LogDate > '2023-10-12T05:00:14.237' ORDER BY LogDate ASC
[2023-10-12T05:10:10,397][INFO ][logstash.inputs.jdbc ][gos][d445a8b13f63b88db9814df5173b9b99d1b89b68c865011e671f3fb2ebeaabc0] (0.003175s) SELECT TOP(100) * from DBTABLE with(nolock) where LogDate > '2023-10-12T12:00:15.930' ORDER BY LogDate ASC
[2023-10-12T05:10:20,517][INFO ][logstash.inputs.jdbc ][gos][d445a8b13f63b88db9814df5173b9b99d1b89b68c865011e671f3fb2ebeaabc0] (0.001379s) SELECT TOP(100) * from DBTABLE with(nolock) where LogDate > '2023-10-12T12:00:15.930' ORDER BY LogDate ASC
Why is that? How is the :sql_last_value being incremented?
I was expecting it to be incremented with the schedule (10s) in order to fetch all the data from 1970-01-01 until today.
As far as I know, last run is set to 1970 and if you run logstash, it processes everything at the beginning and saves the last record's date on the tracking column as last run. This means in the next run, if you have new records, it reads everything coming after this last run. If you do not have new records, it does not read. This is for preventing logstash read multiple times.
When using a timestamp tracking column, when it fetches the result set the input extracts the tracking column from each row and persists the last value it sees to last_run_metadata_path.
Are you trying to do paging to fetch the entire table, but only 100 rows every 10 seconds?
Yes, I am trying to fetch the entire table, but in batches of 100 or 1000 rows every 10s. Is there a proper way to do that? Should I not use the paging and remove the top(100)?
If I query the DB using another tool, I get more than 100000 rows, but with logstash and the above config, I get only 10s of rows ingested.
Exception when executing JDBC query {:exception=>Sequel::DatabaseError, :message=>"Java::ComMicrosoftSqlserverJdbc::SQLServerException: The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP, OFFSET or FOR XML is also specified."
Basically, I am using a numeric tracking column (the unix timestamp in seconds) instead of a timestamp. Also, I make sure the LogDate is between the :sql_last_value and the CURRENT_TIMESTAMP.
Also, ORDER BY needs the OFFSET 0 ROWS.
The select will by default get batches of jdbc_page_size (10000) until it fetches all table data.
Thank you for all your help! Much appreciated!
Catalin
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.