Tune logstash JDBC input for huge datasets

Marco_Lagalla · April 17, 2023, 7:47pm

Hi all!

I am trying to migrate a very huge dataset from a source oracle database to Elasticsearch.
The data that I am trying to migrate is contained in a single table, which is partitioned weekly, based on a timestamp field "CREATED".

This table holds ~30TB of data, containing log payloads, the length of each payload can vary from 5kb up to 70kb size.
I have tried to configure an ingestion pipeline, but it seems to be too much slow.

Logstash is currently running on a dedicated server with 8CPUs and 32GB RAM.

I have tried to play with some input settings like 'jdbc_fetch_size' or 'jdbc_page_size', but I didn't see any improovent.

I had the confirmation that the Elasticsearch cluster is not the bottleneck (same ingestion rate without the elasticsearch output plugin, using stdout as sink), and also network or storage are not limiting anything.

input {

  jdbc {

    jdbc_driver_library => "/usr/share/logstash/ojdbc8.jar"
    jdbc_driver_class => "Java::oracle.jdbc.OracleDriver"
    jdbc_connection_string => "jdbc:oracle:thin:@host:$port/MYDB"
    jdbc_user => "user"
    jdbc_password => "password"
    #schedule => "*/5 * * * *"
    #schedule => "*/30 * * * * *"
    schedule => "* * * * *"
    statement => "SELECT * from MYSCHEMA.MY_TABLE WHERE CREATED > :sql_last_value"
    use_column_value => true
    tracking_column => "CREATED"
    tracking_column_type => "timestamp"
    lowercase_column_names => false
    clean_run => false
    last_run_metadata_path => "/usr/share/logstash/queues/last_run"
    jdbc_page_size => 100000
    jdbc_paging_enabled => true
  }

}
# The filter part of this file is commented out to indicate that it is
# optional.
filter {

  mutate {
    convert => [ "CREATED", "string" ]
    add_field => { "SCHEMA" => "MYSCHEMA" }
    add_field => { "TABLE" => "MY_TABLE" }
  }
  date {
    timezone => "UTC"
    match => ["CREATED", "yyyy-MM-dd HH:mm:ss.SSSSSS", "yyyy-MM-dd HH:mm:ss,SSSSSS", "ISO8601"]
  }
}

output {
    elasticsearch {
        hosts       => ["https://elasticsearch:9200"]
        user        => "user"
        password    => "pass"
        index       => "logs-from-db-%{+YYYY.MM.dd}"
    }
}

Also, my logstash.yml file looks like this;:

pipeline:
  workers: 8
  batch:
    size: 1250
    delay: 50

path:
   config: "/usr/share/logstash/pipeline/pipeline.conf"

With this setup I am able to ingest ~3 milion of records at day (~15GB), which is far way to low for my use case.

Do anybody have some hints on how to tune jdbc input plugin for huge datasets?

Thanks in advance

leandrojmp · April 17, 2023, 7:53pm

Have you tried to increase the batch.size ? Try to double it and see how it behaves.

I would keep doubling it until things stop improving.

Badger · April 18, 2023, 12:37am

One record per second for 8 pipelines is 690K records per day, so each of your pipelines is processing about 5 records per second, which is ridiculously slow.

If the average record is 10 KB and your jdbc_page_size is 100000, then the input is fetching 1 GB of data for each page. I wonder if you are running into GC issues. You could enable GC logging and start tuning GC, or you could try decreasing jdbc_page_size and see if throughput improves. Rinse and repeat.

Marco_Lagalla · April 18, 2023, 9:16am

Thanks for you answer.

I tried to double the batch.size parameter till the value of 10k, but without significant improvements.

I also tried to increase in various step the number of workers, since the CPU usage is close to 4% (across the 8 available cores), but again no changes in the behavior.

Marco_Lagalla · April 18, 2023, 9:18am

Thanks @Badger ,

I am not sure to have understood your hint.

You are suggesting that I can have some issue with Logstash garbage collection, right?
I don't know how to enable logging for GC (I am doing some researches), for sure I will start to enable DEBUG level for logstash logs.

I tried to decrease the page size to 50k, 30k, and 5k, but the overall throughput seems to remain stuck.

system · May 16, 2023, 9:18am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Logstash jdbc-input taking long time to load data from Oracle DataBase Logstash	22	4192	October 5, 2020
Logstash jdbc for mysql Logstash	7	2762	July 6, 2017
Logstash jdbc input plugin not indexing all data to elasticsearch Logstash	2	1499	February 5, 2018
How logstash is working? Logstash	3	2549	October 25, 2018
How many records does jdbc input plugin can read at once? Logstash jdbc	2	317	September 21, 2023

Tune logstash JDBC input for huge datasets

Related topics