Logstash delta pull from postgres

I have a simple logstash pipeline job that is a SQL job that pulls data from a postgres db into ES.

The initial pull was 1.5 million rows, which ran last night and took about 5 minutes. but the delta daily pull is about 4000 rows (4K)

I expected the delta pull be just a few seconds.. however, today's run also took about 5 minutes.

I am wondering why?
I hope it's just pulling delta data... if so, why is it taking almost the same time as pull all the data?

Thanks

Maybe it is pulling all of the data again. What does the configuration look like?

I have pasted the sample from 7.8 doc.

In my case, I have to filter for delta data based on multiple columns, this example shows just one.

How do I go about setting the where clause, and the tracking_column?

input {
jdbc {
statement => "SELECT id, mycolumn1, mycolumn2 FROM my_table WHERE id > :sql_last_value"
use_column_value => true
tracking_column => "id"
# ... other configuration bits
}
}

I tried this, but I do not think it worked well.

my pipeline configuration...
Note that I have commented the --'2019-04-15' portion after '?'

statement=>"select initcap(b.company_name) company_name, a.symbol,a.day_close,a.day_high,a.day_low,
a.trade_date,a.rs,a.high_52week,a.low_52week
from stock_daily_data a,company_listing b where a.trade_date> '?' --'2019-04-15'
and a.symbol=b.symbol"
use_column_value => true
tracking_column => "a.trade_date"
prepared_statement_bind_values => [":sql_last_value"]
prepared_statement_name => "sbsectorsymbolonee"
use_prepared_statements => true
tracking_column_type => "timestamp"
last_run_metadata_path => "/dockermnt/logstash/lastvalues/xx.yml"

logstash log snippet

2020-06-25T23:11:59.023551191Z [2020-06-25T23:11:59,021][INFO ][logstash.inputs.jdbc ][sector_symbol_one][c564e52a52c7396ffa0533a145f2e4d228e32c50e285fcef8a7f2c04cc23171c] (0.056661s) PREPARE xx: select initcap(b.company_name) company_name, a.symbol,a.day_close,a.day_high,a.day_low,
2020-06-25T23:11:59.023648903Z a.trade_date,a.rs,a.high_52week,a.low_52week
2020-06-25T23:11:59.023760102Z from stock_daily_data a,company_listing b where a.trade_date> '?' --'2019-04-15'
2020-06-25T23:11:59.023774379Z and a.symbol=b.symbol
2020-06-25T23:11:59.204127319Z [2020-06-25T23:11:59,203][WARN ][logstash.inputs.jdbc ][sector_symbol_one][c564e52a52c7396ffa0533a145f2e4d228e32c50e285fcef8a7f2c04cc23171c] Exception when executing JDBC query {:exception=>org.postgresql.util.PSQLException: The column index is out of range: 1, number of columns: 0.}

I tried a few other changes to the cfg file, I was not successful.

I am either getting the postgres index error, or logstash is running out of heap, OOM.

Please let me know, how to execute a sql with multiple table join, with multiple filter predicates and perform delta pulls.

For anyone that might stumbe upon this issue and come here searching for answers....

I was able to get it working. Surprisingly, (after 2 days), it seems pretty simple.

here is my logstash config snippet that made the difference....

  1. set the timezone to UTC
  2. pay close attention to the SQL statement
  3. First run, to get all the data, set clean_run = true. Once the first run is complete, remove this directive.
  4. Order by the date column is VERY IMPORTANT.

jdbc_default_timezone=> "UTC"
statement=>"select a.trade_date,a.low_52week
from stock_daily_data a,company_listing b where a.trade_date> :sql_last_value and a.symbol=b.symbol order by trade_date"
use_column_value => true
tracking_column => "trade_date"
tracking_column_type => "timestamp"

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.