Tune logstash JDBC input for huge datasets

Hi all!

I am trying to migrate a very huge dataset from a source oracle database to Elasticsearch.
The data that I am trying to migrate is contained in a single table, which is partitioned weekly, based on a timestamp field "CREATED".

This table holds ~30TB of data, containing log payloads, the length of each payload can vary from 5kb up to 70kb size.
I have tried to configure an ingestion pipeline, but it seems to be too much slow.

Logstash is currently running on a dedicated server with 8CPUs and 32GB RAM.

I have tried to play with some input settings like 'jdbc_fetch_size' or 'jdbc_page_size', but I didn't see any improovent.

I had the confirmation that the Elasticsearch cluster is not the bottleneck (same ingestion rate without the elasticsearch output plugin, using stdout as sink), and also network or storage are not limiting anything.

input {

  jdbc {

    jdbc_driver_library => "/usr/share/logstash/ojdbc8.jar"
    jdbc_driver_class => "Java::oracle.jdbc.OracleDriver"
    jdbc_connection_string => "jdbc:oracle:thin:@host:$port/MYDB"
    jdbc_user => "user"
    jdbc_password => "password"
    #schedule => "*/5 * * * *"
    #schedule => "*/30 * * * * *"
    schedule => "* * * * *"
    statement => "SELECT * from MYSCHEMA.MY_TABLE WHERE CREATED > :sql_last_value"
    use_column_value => true
    tracking_column => "CREATED"
    tracking_column_type => "timestamp"
    lowercase_column_names => false
    clean_run => false
    last_run_metadata_path => "/usr/share/logstash/queues/last_run"
    jdbc_page_size => 100000
    jdbc_paging_enabled => true

# The filter part of this file is commented out to indicate that it is
# optional.
filter {

  mutate {
    convert => [ "CREATED", "string" ]
    add_field => { "SCHEMA" => "MYSCHEMA" }
    add_field => { "TABLE" => "MY_TABLE" }
  date {
    timezone => "UTC"
    match => ["CREATED", "yyyy-MM-dd HH:mm:ss.SSSSSS", "yyyy-MM-dd HH:mm:ss,SSSSSS", "ISO8601"]

output {
    elasticsearch {
        hosts       => ["https://elasticsearch:9200"]
        user        => "user"
        password    => "pass"
        index       => "logs-from-db-%{+YYYY.MM.dd}"

Also, my logstash.yml file looks like this;:

  workers: 8
    size: 1250
    delay: 50

   config: "/usr/share/logstash/pipeline/pipeline.conf"

With this setup I am able to ingest ~3 milion of records at day (~15GB), which is far way to low for my use case.

Do anybody have some hints on how to tune jdbc input plugin for huge datasets?

Thanks in advance

Have you tried to increase the batch.size ? Try to double it and see how it behaves.

I would keep doubling it until things stop improving.

One record per second for 8 pipelines is 690K records per day, so each of your pipelines is processing about 5 records per second, which is ridiculously slow.

If the average record is 10 KB and your jdbc_page_size is 100000, then the input is fetching 1 GB of data for each page. I wonder if you are running into GC issues. You could enable GC logging and start tuning GC, or you could try decreasing jdbc_page_size and see if throughput improves. Rinse and repeat.

Thanks for you answer.

I tried to double the batch.size parameter till the value of 10k, but without significant improvements.

I also tried to increase in various step the number of workers, since the CPU usage is close to 4% (across the 8 available cores), but again no changes in the behavior.

Thanks @Badger ,

I am not sure to have understood your hint.

You are suggesting that I can have some issue with Logstash garbage collection, right?
I don't know how to enable logging for GC (I am doing some researches), for sure I will start to enable DEBUG level for logstash logs.

I tried to decrease the page size to 50k, 30k, and 5k, but the overall throughput seems to remain stuck.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.