We are getting Java OutOfMemory during initial loading a huge dataset of ~70 Mio records with Logstash and jdbc into Elastic. It seems, the problem starts when using use_prepared_statements => true
.
Java Heapsize is 4GB, but even with 8 or 16GB the situation is the same.
We have successfully loaded the whole view once with use_prepared_statements=>false
, but to have it more efficient even for delta loads, we would like to use prepared statements.
Our idea is to use jdbc_fetch_size=>1000
to limit the amount of data within one fetch.
According documentation, using jdbc_page_size
is not in effect when using prepared_statements. It would even not be efficient, because of the complex view behind the scene, which would not be efficient when multiple queries using offsets each time for the initial load.
Within Oracle, we see the query has finished and fetches were around 1 mio records out of ~70 mio done when Logstash failed with OutOfMemory failure.
Our current logstash config:
input {
jdbc {
type => "sim_list"
jdbc_validate_connection => true
jdbc_driver_library => "/pkg/moip/otc/apps/ls-otc-tmsp_tua2/conf/ojdbc6.jar"
jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
jdbc_connection_string => "jdbc:oracle:thin:@db-server:1521:TA2A"
jdbc_user => "test"
jdbc_password => "testpwd"
tracking_column => "last_import"
use_column_value => true
tracking_column_type => "timestamp"
prepared_statement_bind_values => [":sql_last_value"]
prepared_statement_name => "get_sim_list_tua2_v5"
use_prepared_statements => true
last_run_metadata_path => "/pkg/moip/otc/data/ls-otc-tmsp_tua2/jdbc/sql_last_run_sim_list_tua2_v5"
statement => "SELECT simlist.*, simlist.LAST_UPDATE as last_import
FROM V_SIM_LIST_VIEW simlist
WHERE last_update between (?) - 1/86400 and sysdate
ORDER BY last_update ASC"
# run every 5 seconds
schedule => "*/5 * * * * *"
connection_retry_attempts => "3"
jdbc_fetch_size => 1000
}
}
filter {
mutate {
copy => {
"%{di_id}" => "[@metadata][_id]"
}
split => { "user_ids" => "," }
split => { "option_ids" => "|" }
split => {"apn_ids" => "|"}
split => { "ip_address" => "|" }
}
}
output {
if [type] == "sim_list" {
elasticsearch {
hosts => ["https://ece.mydomain.com:443"]
index => "sim_list_tua2_v5"
document_id => "%{di_id}"
#doc_as_upsert => true
manage_template => false
cacert => "/pkg/moip/otc/apps/ls-otc-tmsp_tua2/conf/certs/ca.crt"
ssl => true
ssl_certificate_verification => true
user => "logstash_write"
password => "${LSWRITEPWD}"
}
}
}
Idea is, to use one big query for initial load, which should be split by jdbc_fetch_size
.
Next schedules shall use :sql_last_value
to get incremential changes.
However, it breaks soon after first couple of fetches.
Currently, it seems, the jdbc_fetch_size
doesn't work as expected. So we have no clue, why it fails all the time.