How to get data from large JDBC quires?

Hello,

I am trying to get data from a postgres database using JDBC. I managed to get data to stdout by using only 3 fields, but it fails because of the large number of fields in the query.

here is the output from running the query with many fields:
WARNING: Could not find logstash.yml which is typically located in $LS_HOME/config or /etc/logstash. You can specify the path using --path.settings. Continuing using the defaults
Could not find log4j2 configuration at path /usr/share/logstash/config/log4j2.properties. Using default config which logs errors to the console
[INFO ] 2020-11-08 02:07:06.227 [main] runner - Starting Logstash {"logstash.version"=>"7.9.3", "jruby.version"=>"jruby 9.2.13.0 (2.5.7) 2020-08-03 9a89c94bcc OpenJDK 64-Bit Server VM 25.272-b10 on 1.8.0_272-8u272-b10-0ubuntu1~20.04-b10 +indy +jit [linux-x86_64]"}
[WARN ] 2020-11-08 02:07:06.952 [LogStash::Runner] multilocal - Ignoring the 'pipelines.yml' file because modules or command line options are specified
[INFO ] 2020-11-08 02:07:10.906 [Converge PipelineAction::Create] Reflections - Reflections took 87 ms to scan 1 urls, producing 22 keys and 45 values
[INFO ] 2020-11-08 02:07:12.070 [[main]-pipeline-manager] javapipeline - Starting pipeline {:pipeline_id=>"main", "pipeline.workers"=>8, "pipeline.batch.size"=>125, "pipeline.batch.delay"=>50, "pipeline.max_inflight"=>1000, "pipeline.sources"=>["/etc/logstash/conf.d/payments_transferFaliure.conf"], :thread=>"#<Thread:0x48330599 run>"}
[INFO ] 2020-11-08 02:07:14.041 [[main]-pipeline-manager] javapipeline - Pipeline Java execution initialization time {"seconds"=>1.96}
[INFO ] 2020-11-08 02:07:14.503 [[main]-pipeline-manager] javapipeline - Pipeline started {"pipeline.id"=>"main"}
[INFO ] 2020-11-08 02:07:14.612 [Agent thread] agent - Pipelines running {:count=>1, :running_pipelines=>[:main], :non_running_pipelines=>}
[INFO ] 2020-11-08 02:07:15.334 [Api Webserver] agent - Successfully started Logstash API endpoint {:port=>9600}
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/rufus-scheduler-3.0.9/lib/rufus/scheduler/cronline.rb:77: warning: constant ::Fixnum is deprecated

It will not stop. the prompt just blinking for hours.

Here is the statment that works "only two fields":
statement => "SELECT id as id, created as created FROM payments_transfer"

And this is the query with many fields that is not working "same settings as the above query!" :
statement => "SELECT id as id, type as type, dwolla_id as dwolla_id, sender_id as sender_id, receiver_id as receiver_id, status as status, amount as amount, fees as fees, created as created, metadata as metadata, funding_source_from_id as funding_source_from_id, funding_source_to_id as funding_source_to_id, is_parent as is_parent, destination_type as destination_type, source_type as source_type FROM payments_transfer"

How to solve it to work with big quiries?

Please make sue your format your code and error as follow and share your config : /etc/logstash/conf.d/payments_transferFaliure.conf

WARNING: Could not find logstash.yml which is typically located in $LS_HOME/config or /etc/logstash. You can specify the path using --path.settings. Continuing using the defaults
Could not find log4j2 configuration at path /usr/share/logstash/config/log4j2.properties. Using default config which logs errors to the console
[INFO ] 2020-11-08 02:07:06.227 [main] runner - Starting Logstash {"logstash.version"=>"7.9.3", "jruby.version"=>"jruby 9.2.13.0 (2.5.7) 2020-08-03 9a89c94bcc OpenJDK 64-Bit Server VM 25.272-b10 on 1.8.0_272-8u272-b10-0ubuntu1~20.04-b10 +indy +jit [linux-x86_64]"}
[WARN ] 2020-11-08 02:07:06.952 [LogStash::Runner] multilocal - Ignoring the 'pipelines.yml' file because modules or command line options are specified
[INFO ] 2020-11-08 02:07:10.906 [Converge PipelineAction::Create] Reflections - Reflections took 87 ms to scan 1 urls, producing 22 keys and 45 values
[INFO ] 2020-11-08 02:07:12.070 [[main]-pipeline-manager] javapipeline - Starting pipeline {:pipeline_id=>"main", "pipeline.workers"=>8, "pipeline.batch.size"=>125, "pipeline.batch.delay"=>50, "pipeline.max_inflight"=>1000, "pipeline.sources"=>["/etc/logstash/conf.d/payments_transferFaliure.conf"], :thread=>"#<Thread:0x48330599 run>"}
[INFO ] 2020-11-08 02:07:14.041 [[main]-pipeline-manager] javapipeline - Pipeline Java execution initialization time {"seconds"=>1.96}
[INFO ] 2020-11-08 02:07:14.503 [[main]-pipeline-manager] javapipeline - Pipeline started {"pipeline.id"=>"main"}
[INFO ] 2020-11-08 02:07:14.612 [Agent thread] agent - Pipelines running {:count=>1, :running_pipelines=>[:main], :non_running_pipelines=>}
[INFO ] 2020-11-08 02:07:15.334 [Api Webserver] agent - Successfully started Logstash API endpoint {:port=>9600}
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/rufus-scheduler-3.0.9/lib/rufus/scheduler/cronline.rb:77: warning: constant ::Fixnum is deprecated
SELECT id                     AS id, 
       type                   AS type, 
       dwolla_id              AS dwolla_id, 
       sender_id              AS sender_id, 
       receiver_id            AS receiver_id, 
       status                 AS status, 
       amount                 AS amount, 
       fees                   AS fees, 
       created                AS created, 
       metadata               AS metadata, 
       funding_source_from_id AS funding_source_from_id, 
       funding_source_to_id   AS funding_source_to_id, 
       is_parent              AS is_parent, 
       destination_type       AS destination_type, 
       source_type            AS source_type 
FROM   payments_transfer 
input {
    jdbc {
        jdbc_connection_string => "jdbc:postgresql://ldatabase_rds_path?useSSL=true"
        jdbc_user => "username"
        jdbc_password => "password"
        jdbc_driver_library => "/home/z/Documents/postgresql-42.2.18.jar"
        jdbc_driver_class => "org.postgresql.Driver"
        tracking_column => "id"
        tracking_column_type => "numeric"
         clean_run => true
        schedule => "0 */1 * * *"
        statement => "SELECT id as id, type as type, z_id as z_id, sender_id as sender_id, receiver_id as receiver_id, status as status, amount as amount, fees as fees, created as created, metadata as metadata, funding_source_from_id as funding_source_from_id, funding_source_to_id as funding_source_to_id, is_parent as is_parent, destination_type as destination_type, source_type as source_type FROM payments_transfer"
    }
}
output {
        stdout { codec => json_lines }
       elasticsearch {
              hosts => ["localhost:9200"] 
             manage_template => false
             index => "payments-transfer-query"
             document_id => "%{id}"
             }
}

In the input, try this

input {
    jdbc {
        jdbc_connection_string => "jdbc:postgresql://ldatabase_rds_path?useSSL=true"
        jdbc_user => "username"
        jdbc_password => "password"
        jdbc_driver_library => "/home/z/Documents/postgresql-42.2.18.jar"
        jdbc_driver_class => "org.postgresql.Driver"
        tracking_column => "id"
        tracking_column_type => "numeric"
         clean_run => true
        schedule => "0 */1 * * *"
        jdbc_page_size => 50000
        last_run_metadata_path => "/path/to/last_run_value.txt"
        statement => "SELECT id as id, type as type, z_id as z_id, sender_id as sender_id, receiver_id as receiver_id, status as status, amount as amount, fees as fees, created as created, metadata as metadata, funding_source_from_id as funding_source_from_id, funding_source_to_id as funding_source_to_id, is_parent as is_parent, destination_type as destination_type, source_type as source_type FROM payments_transfer WHERE id > :sql_last_value"
    }
}

The jdbc_page_size is to limit the result in each time you run from schedule, you can modify as you like following the capacity of your machine
last_run_meta_data_path is the file to store the last value of the id in each query
WHERE id > :sql_last_value will read the value from last_run_meta_data value, if you run in first time, it will be 0, but it's limit by pagination.
I suggest to limit the result of the query because if you run in scheduled, perhaps the query not yet finished and logstash will re-query the same value and makes your machine slower.
Also I suggest, run the logstash in more frequent and get smaller amount of the query result, becsuse it will hurt your machine if you run once with large mount of result in one go.

I tried your solution by adding the jdbc_page_size and last_run_meta_data_path and I also reduced the timer to 0 */12 * * * # Every twelve hours at minute 0 of the hour. Still not working, It just takes hours of running without any use:

here is my console which stays as that for hours:

WARNING: Could not find logstash.yml which is typically located in $LS_HOME/config or /etc/logstash. You can specify the path using --path.settings. Continuing using the defaults
Could not find log4j2 configuration at path /usr/share/logstash/config/log4j2.properties. Using default config which logs errors to the console
[INFO ] 2020-11-11 16:28:22.850 [main] runner - Starting Logstash {"logstash.version"=>"7.9.3", "jruby.version"=>"jruby 9.2.13.0 (2.5.7) 2020-08-03 9a89c94bcc OpenJDK 64-Bit Server VM 25.272-b10 on 1.8.0_272-8u272-b10-0ubuntu1~20.04-b10 +indy +jit [linux-x86_64]"}
[WARN ] 2020-11-11 16:28:24.130 [LogStash::Runner] multilocal - Ignoring the 'pipelines.yml' file because modules or command line options are specified
[INFO ] 2020-11-11 16:28:28.736 [Converge PipelineAction::Create<main>] Reflections - Reflections took 99 ms to scan 1 urls, producing 22 keys and 45 values 
[INFO ] 2020-11-11 16:28:31.128 [[main]-pipeline-manager] elasticsearch - Elasticsearch pool URLs updated {:changes=>{:removed=>[], :added=>[http://localhost:9200/]}}
[WARN ] 2020-11-11 16:28:31.591 [[main]-pipeline-manager] elasticsearch - Restored connection to ES instance {:url=>"http://localhost:9200/"}
[INFO ] 2020-11-11 16:28:31.980 [[main]-pipeline-manager] elasticsearch - ES Output version determined {:es_version=>7}
[WARN ] 2020-11-11 16:28:31.985 [[main]-pipeline-manager] elasticsearch - Detected a 6.x and above cluster: the `type` event field won't be used to determine the document _type {:es_version=>7}
[INFO ] 2020-11-11 16:28:32.195 [[main]-pipeline-manager] elasticsearch - New Elasticsearch output {:class=>"LogStash::Outputs::ElasticSearch", :hosts=>["//localhost:9200"]}
[INFO ] 2020-11-11 16:28:32.576 [[main]-pipeline-manager] javapipeline - Starting pipeline {:pipeline_id=>"main", "pipeline.workers"=>8, "pipeline.batch.size"=>125, "pipeline.batch.delay"=>50, "pipeline.max_inflight"=>1000, "pipeline.sources"=>["/etc/logstash/conf.d/payments_transferFaliure_with_large_capacity.conf"], :thread=>"#<Thread:0x4fcc21a9 run>"}
[INFO ] 2020-11-11 16:28:34.619 [[main]-pipeline-manager] javapipeline - Pipeline Java execution initialization time {"seconds"=>2.04}
[INFO ] 2020-11-11 16:28:34.909 [[main]-pipeline-manager] javapipeline - Pipeline started {"pipeline.id"=>"main"}
[INFO ] 2020-11-11 16:28:35.119 [Agent thread] agent - Pipelines running {:count=>1, :running_pipelines=>[:main], :non_running_pipelines=>[]}
[INFO ] 2020-11-11 16:28:35.882 [Api Webserver] agent - Successfully started Logstash API endpoint {:port=>9600}
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/rufus-scheduler-3.0.9/lib/rufus/scheduler/cronline.rb:77: warning: constant ::Fixnum is deprecated

Did you run the logstash from systemv (systemctl) or using command line?
If using the systemctl, what is your pipeline.yml config?

I run it using the command line: sudo /usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/payments_transferFaliure_with_large_capacity.conf --config.reload.automatic

If more than shell, I specify a directory for the data.

here is my pipeline.yml config:

- pipeline.id: main
  path.config: "/etc/logstash/conf.d/*.conf"

Pelase try this from command line:

sudo /usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/*.conf --config.reload.automatic --path-settings /etc/logstash

To make sure the logstash.yml and pipeline.yml loaded properly.

I tried with a command line that specifies both the data and pipeline paths.: --path.data /usr/share/logstash/data/[directory] --path.settings /etc/logstash, and still not working

lets put the debug result in here with this

input {
    jdbc {
        jdbc_connection_string => "jdbc:postgresql://ldatabase_rds_path?useSSL=true"
        jdbc_user => "username"
        jdbc_password => "password"
        jdbc_driver_library => "/home/z/Documents/postgresql-42.2.18.jar"
        jdbc_driver_class => "org.postgresql.Driver"
        tracking_column => "id"
        tracking_column_type => "numeric"
         clean_run => true
        schedule => "* * * * *"
        jdbc_page_size => 50000
        last_run_metadata_path => "/path/to/last_run_value.txt"
        statement => "SELECT id as id, type as type, z_id as z_id, sender_id as sender_id, receiver_id as receiver_id, status as status, amount as amount, fees as fees, created as created, metadata as metadata, funding_source_from_id as funding_source_from_id, funding_source_to_id as funding_source_to_id, is_parent as is_parent, destination_type as destination_type, source_type as source_type FROM payments_transfer WHERE id > :sql_last_value"
    }
}

filter  {
}

output {
    stdout { codec => rubydebug }
}

Then paste the error message from the stdout in here, then we can analize the detail error.

Thanks. I added the debug output line, but the cursor also keep blinking for hours without any output:

> Using bundled JDK: /usr/share/logstash/jdk
> OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
> WARNING: An illegal reflective access operation has occurred
> WARNING: Illegal reflective access by org.jruby.ext.openssl.SecurityHelper (file:/tmp/jruby-8752/jruby3327211400598248584jopenssl.jar) to field java.security.MessageDigest.provider
> WARNING: Please consider reporting this to the maintainers of org.jruby.ext.openssl.SecurityHelper
> WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
> WARNING: All illegal access operations will be denied in a future release
> WARNING: Could not find logstash.yml which is typically located in $LS_HOME/config or /etc/logstash. You can specify the path using --path.settings. Continuing using the defaults
> Could not find log4j2 configuration at path /usr/share/logstash/config/log4j2.properties. Using default config which logs errors to the console
> [INFO ] 2020-11-21 23:28:24.783 [main] runner - Starting Logstash {"logstash.version"=>"7.10.0", "jruby.version"=>"jruby 9.2.13.0 (2.5.7) 2020-08-03 9a89c94bcc OpenJDK 64-Bit Server VM 11.0.8+10 on 11.0.8+10 +indy +jit [linux-x86_64]"}
> [WARN ] 2020-11-21 23:28:25.113 [LogStash::Runner] multilocal - Ignoring the 'pipelines.yml' file because modules or command line options are specified
> [INFO ] 2020-11-21 23:28:26.966 [Converge PipelineAction::Create<main>] Reflections - Reflections took 43 ms to scan 1 urls, producing 23 keys and 47 values 
> [INFO ] 2020-11-21 23:28:27.535 [[main]-pipeline-manager] javapipeline - Starting pipeline {:pipeline_id=>"main", "pipeline.workers"=>8, "pipeline.batch.size"=>125, "pipeline.batch.delay"=>50, "pipeline.max_inflight"=>1000, "pipeline.sources"=>["/etc/logstash/conf.d/payment_transfer.conf"], :thread=>"#<Thread:0x23ef9f58 run>"}
> [INFO ] 2020-11-21 23:28:28.239 [[main]-pipeline-manager] javapipeline - Pipeline Java execution initialization time {"seconds"=>0.7}
> [INFO ] 2020-11-21 23:28:28.388 [[main]-pipeline-manager] javapipeline - Pipeline started {"pipeline.id"=>"main"}
> [INFO ] 2020-11-21 23:28:28.449 [Agent thread] agent - Pipelines running {:count=>1, :running_pipelines=>[:main], :non_running_pipelines=>[]}
> [INFO ] 2020-11-21 23:28:28.693 [Api Webserver] agent - Successfully started Logstash API endpoint {:port=>9601}
> /usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/rufus-scheduler-3.0.9/lib/rufus/scheduler/cronline.rb:77: warning: constant ::Fixnum is deprecated

I also found that it works in seconds after commenting the scheduler line which means there is something wrong with the scheduler, but I can't figure it out!.

If you schedule an input to run on the hours 0 and 12 I believe it will wait for one of them before the first time it runs it. Similarly if you configure it to run every 12 hours then it will wait 12 hours before running.

If you want to unconditionally run at startup then use a second input configured without a schedule option. But note that if you start logstash at 11:50 then that means the query might run twice in ten minutes. That may or may not be a problem for your use case.

Thanks it works.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.