Java OutOfMemory when synchronizing initially huge resultsets from Oracle via logstash jdbc to Elastic

We are getting Java OutOfMemory during initial loading a huge dataset of ~70 Mio records with Logstash and jdbc into Elastic. It seems, the problem starts when using use_prepared_statements => true.
Java Heapsize is 4GB, but even with 8 or 16GB the situation is the same.

We have successfully loaded the whole view once with use_prepared_statements=>false, but to have it more efficient even for delta loads, we would like to use prepared statements.
Our idea is to use jdbc_fetch_size=>1000 to limit the amount of data within one fetch.
According documentation, using jdbc_page_size is not in effect when using prepared_statements. It would even not be efficient, because of the complex view behind the scene, which would not be efficient when multiple queries using offsets each time for the initial load.

Within Oracle, we see the query has finished and fetches were around 1 mio records out of ~70 mio done when Logstash failed with OutOfMemory failure.

Our current logstash config:

input {
    jdbc {
        type => "sim_list"
        jdbc_validate_connection => true
        jdbc_driver_library => "/pkg/moip/otc/apps/ls-otc-tmsp_tua2/conf/ojdbc6.jar"
        jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
        jdbc_connection_string => "jdbc:oracle:thin:@db-server:1521:TA2A"
        jdbc_user => "test"
        jdbc_password => "testpwd"
        tracking_column => "last_import"
        use_column_value => true
        tracking_column_type => "timestamp"
        prepared_statement_bind_values => [":sql_last_value"]
        prepared_statement_name => "get_sim_list_tua2_v5"
        use_prepared_statements => true
        last_run_metadata_path => "/pkg/moip/otc/data/ls-otc-tmsp_tua2/jdbc/sql_last_run_sim_list_tua2_v5"
        statement => "SELECT simlist.*, simlist.LAST_UPDATE as last_import
        FROM V_SIM_LIST_VIEW simlist
        WHERE last_update between (?) - 1/86400 and sysdate
          ORDER BY last_update ASC"
        # run every 5 seconds
        schedule => "*/5 * * * * *"
        connection_retry_attempts => "3"
        jdbc_fetch_size => 1000
    }
}

filter {
    mutate {
        copy => {
            "%{di_id}" => "[@metadata][_id]"
        }
        split => { "user_ids" => "," }
        split => { "option_ids" => "|" }
        split => {"apn_ids" => "|"}
        split => { "ip_address" => "|" }
    }
}

output {
    if [type] == "sim_list" {
        elasticsearch {
            hosts => ["https://ece.mydomain.com:443"]
            index => "sim_list_tua2_v5"
            document_id => "%{di_id}"
            #doc_as_upsert => true
            manage_template => false
            cacert => "/pkg/moip/otc/apps/ls-otc-tmsp_tua2/conf/certs/ca.crt"
            ssl => true
            ssl_certificate_verification => true
            user => "logstash_write"
            password => "${LSWRITEPWD}"
        }
    }
}

Idea is, to use one big query for initial load, which should be split by jdbc_fetch_size.
Next schedules shall use :sql_last_value to get incremential changes.
However, it breaks soon after first couple of fetches.

Currently, it seems, the jdbc_fetch_size doesn't work as expected. So we have no clue, why it fails all the time.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.