Error with paging large result set using SQL Server

Hello everyone!
I'm trying to import a data set with more than 4 million rows from my database in SQL Server.

I followed the advice here Jdbc input plugin | Logstash Reference [8.4] | Elastic and set jdbc_fetch_size with a small value of 1000 but nothing happened and the process is exiting by timeout.

Then I tried to set jdbc_paging_enabled => true and jdbc_paging_mode => "explicit" in order to manage limit/offset manually but after running the pipeline I'm receiving this error:

[2022-09-01T20:00:30,453][ERROR][logstash.inputs.jdbc     ] Unknown setting 'jdbc_paging_mode' for jdbc
[2022-09-01T20:00:30,460][ERROR][logstash.agent           ] Failed to execute action {:action=>LogStash::PipelineAction::Create/pipeline_id:my-pipeline, :exception=>"Java::JavaLang::IllegalStateException", :message=>"Unable to configure plugins: (ConfigurationError) Something is wrong with your configuration.", :backtrace=>["org.logstash.config.ir.CompiledPipeline.<init>
(CompiledPipeline.java:119)", "org.logstash.execution.JavaBasePipelineExt.initialize(JavaBasePipelineExt.java:83)", "org.logstash.execution.JavaBasePipelineExt$INVOKER$i$1$0$initialize.call(JavaBasePipelineExt$INVOKER$i$1$0$initialize.gen)", "org.jruby.internal.runtime.methods.JavaMethod$JavaMethodN.call(JavaMethod.java:837)", "org.jruby.ir.runtime.IRRuntimeHelpers.instanceSuper(IRRuntimeHelpers.java:1169)", "org.jruby.ir.instructions.InstanceSuperInstr.interpret(InstanceSuperInstr.java:84)", "org.jruby.ir.interpreter.InterpreterEngine.processCall(InterpreterEngine.java:361)", "org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:72)", "org.jruby.internal.runtime.methods.MixedModeIRMethod.INTERPRET_METHOD(MixedModeIRMethod.java:86)", "org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:73)", "org.jruby.ir.targets.InvokeSite.invoke(InvokeSite.java:207)", "usr.share.logstash.logstash_minus_core.lib.logstash.pipeline_action.create.RUBY$method$execute$0(/usr/share/logstash/logstash-core/lib/logstash/pipeline_action/create.rb:52)", "usr.share.logstash.logstash_minus_core.lib.logstash.pipeline_action.create.RUBY$method$execute$0$__VARARGS__(/usr/share/logstash/logstash-core/lib/logstash/pipeline_action/create.rb)", "org.jruby.internal.runtime.methods.CompiledIRMethod.call(CompiledIRMethod.java:80)", "org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:70)", "org.jruby.ir.targets.InvokeSite.invoke(InvokeSite.java:207)", "usr.share.logstash.logstash_minus_core.lib.logstash.agent.RUBY$block$converge_state$2(/usr/share/logstash/logstash-core/lib/logstash/agent.rb:389)", "org.jruby.runtime.CompiledIRBlockBody.callDirect(CompiledIRBlockBody.java:138)", "org.jruby.runtime.IRBlockBody.call(IRBlockBody.java:58)", "org.jruby.runtime.IRBlockBody.call(IRBlockBody.java:52)", "org.jruby.runtime.Block.call(Block.java:139)", "org.jruby.RubyProc.call(RubyProc.java:318)", "org.jruby.internal.runtime.RubyRunnable.run(RubyRunnable.java:105)", "java.base/java.lang.Thread.run(Thread.java:834)"]}

warning: thread "Converge PipelineAction::Create<my-pipeline>" terminated with exception (report_on_exception is true):

LogStash::Error: Don't know how to handle `Java::JavaLang::IllegalStateException` for `PipelineAction::Create<my-pipeline>`
  create at org/logstash/execution/ConvergeResultExt.java:129
  add at org/logstash/execution/ConvergeResultExt.java:57
  converge_state at /usr/share/logstash/logstash-core/lib/logstash/agent.rb:402

It seems jdbc_paging_mode does not exists? Or is only valid for other drivers different to SQL Server?

So, how could I paginate results efficently using SQL Server? Can't find any example in the documentation or in the Github plugin repo neither.

Here's a copy from my conf

input {
  jdbc {
    jdbc_connection_string => "jdbc:sqlserver://${DB_SERVER};databaseName=${DB_NAME}"
    jdbc_user => "${DB_USER}"
    jdbc_password => "${DB_PASSWORD}"
    jdbc_driver_library => "./logstash-core/lib/jars/mssql-jdbc-11.2.0.jre11.jar" # Driver downloaded from https://docs.microsoft.com/en-us/sql/connect/jdbc/download-microsoft-jdbc-driver-for-sql-server?view=sql-server-ver15
    jdbc_driver_class => "com.microsoft.sqlserver.jdbc.SQLServerDriver"
    jdbc_paging_enabled => true
    jdbc_paging_mode => "explicit"
    jdbc_page_size => 1000
    jdbc_fetch_size => 1000
    lowercase_column_names => false
    record_last_run => true
    schedule => "*/10 * * * * *"
    statement => "SELECT * FROM MyTable WHERE CreatedDate > :sql_last_value AND CreatedDate < GETUTCDATE() ORDER BY CreatedDate OFFSET :size ROWS FETCH NEXT :offset ROWS ONLY"
    use_column_value => true
    tracking_column => "CreatedDate"
    tracking_column_type => "timestamp"
  }
}

filter {

}

output {
  stdout { codec => rubydebug { metadata => true } }
}

Any help would be appreciate it!

Regards,
Manuel

The jdbc input plugin is built on top of the logstash-integration-jdbc plugin. The jdbc_paging_mode option was added to that in version 5.2.0. Perhaps you have version 5.1.10 of the integration plugin installed.

Hey @Badger thanks for the reply!

Oh, I see... looks like server is using and very outdated version. I'm going to update to a latest version in my localhost and see what happens.

Thanks!

After updating to the latest version I could start moving data to elastic, but after 30 minutes approx, my connection to the DB is hanging with this message:

[2022-09-05T14:31:05,342][ERROR][logstash.inputs.jdbc     ][my-pipeline][76b2ed409d2761db0fcdbf6723e34c55ea9a23122a5b5bc3d151369078e46b2e] Java::ComMicrosoftSqlserverJdbc::SQLServerException: SQL Server did not return a response. The connection has been closed.: SELECT *
FROM MyTable
WHERE CreatedDate > '1970-01-01T00:00:00.000' AND CreatedDate < GETUTCDATE()
ORDER BY CreatedDate
OFFSET 365000 ROWS
FETCH NEXT 5000 ROWS ONLY

[2022-09-05T14:31:05,402][WARN ][logstash.inputs.jdbc     ][my-pipeline][76b2ed409d2761db0fcdbf6723e34c55ea9a23122a5b5bc3d151369078e46b2e] Exception when executing JDBC query {:exception=>Sequel::DatabaseError, :message=>"Java::ComMicrosoftSqlserverJdbc::SQLServerException: SQL Server did not return a response. The connection has been closed.", :cause=>"#<Java::ComMicrosoftSqlserverJdbc::SQLServerException: SQL Server did not return a response. The connection has been closed.>"}
[2022-09-05T14:31:09,645][INFO ][logstash.inputs.jdbc     ][my-pipeline]

I suspect SQL Server is closing the connection after this time, so I tried modifying properties but without luck.

input {
  jdbc {
    connection_retry_attempts => 5
    connection_retry_attempts_wait_time => 5
    jdbc_connection_string => "jdbc:sqlserver://${DB_SERVER};databaseName=${DB_NAME}"
    jdbc_user => "${DB_USER}"
    jdbc_password => "${DB_PASSWORD}"
    jdbc_driver_library => "./logstash-core/lib/jars/mssql-jdbc-11.2.0.jre11.jar" # Driver downloaded from https://docs.microsoft.com/en-us/sql/connect/jdbc/download-microsoft-jdbc-driver-for-sql-server?view=sql-server-ver15
    jdbc_driver_class => "com.microsoft.sqlserver.jdbc.SQLServerDriver"
    jdbc_pool_timeout => 60
    jdbc_paging_enabled => true
    jdbc_paging_mode => "explicit"
    jdbc_page_size => 5000
    jdbc_fetch_size => 1000
    jdbc_validate_connection => true
    jdbc_validation_timeout => 900
    sequel_opts => {
      max_connections => 6
      log_connection_info => true
    }
    lowercase_column_names => false
    record_last_run => true
    schedule => "*/10 * * * * *"
    statement => "SELECT * FROM MyTable WHERE CreatedDate > :sql_last_value AND CreatedDate < GETUTCDATE() ORDER BY CreatedDate OFFSET :size ROWS FETCH NEXT :offset ROWS ONLY"
    use_column_value => true
    tracking_column => "CreatedDate"
    tracking_column_type => "timestamp"
  }
}

filter {
  sleep {
    time => "1"
    every => 1000
  }
}

output {
  stdout { codec => rubydebug { metadata => true } }
}

Is there a way to reconnect or create a new connection to the DB after giving it a certain amount of time? Or what can be the best approach to solve this?

Thanks in advance,

I would look for exceptions logged in the SQL Server logs.

I tried running the same pipeline but changing the query to another table and worked without any interruption after importing 4M of rows.
Maybe this table could have data corrupted or something weird that causes the connection error, but I don't know I'm just guessing.

Anyway, I need to find a workaround, so, I was wondering if there's a way to limit the execution of the input stage, or if I can write the progress of every round in my last_run_metadata_path. In that case, after cutting the connection I could restart from the previous value instead of starting from the beginning as it's doing now.

The input only writes the last_value to the last_run_metadata_path if the SQL query executes without exception. It does not track partial executions.

Ok @Badger , thanks