Hi all!
I am trying to migrate a very huge dataset from a source oracle database to Elasticsearch.
The data that I am trying to migrate is contained in a single table, which is partitioned weekly, based on a timestamp field "CREATED".
This table holds ~30TB of data, containing log payloads, the length of each payload can vary from 5kb up to 70kb size.
I have tried to configure an ingestion pipeline, but it seems to be too much slow.
Logstash is currently running on a dedicated server with 8CPUs and 32GB RAM.
I have tried to play with some input settings like 'jdbc_fetch_size' or 'jdbc_page_size', but I didn't see any improovent.
I had the confirmation that the Elasticsearch cluster is not the bottleneck (same ingestion rate without the elasticsearch output plugin, using stdout as sink), and also network or storage are not limiting anything.
input {
jdbc {
jdbc_driver_library => "/usr/share/logstash/ojdbc8.jar"
jdbc_driver_class => "Java::oracle.jdbc.OracleDriver"
jdbc_connection_string => "jdbc:oracle:thin:@host:$port/MYDB"
jdbc_user => "user"
jdbc_password => "password"
#schedule => "*/5 * * * *"
#schedule => "*/30 * * * * *"
schedule => "* * * * *"
statement => "SELECT * from MYSCHEMA.MY_TABLE WHERE CREATED > :sql_last_value"
use_column_value => true
tracking_column => "CREATED"
tracking_column_type => "timestamp"
lowercase_column_names => false
clean_run => false
last_run_metadata_path => "/usr/share/logstash/queues/last_run"
jdbc_page_size => 100000
jdbc_paging_enabled => true
}
}
# The filter part of this file is commented out to indicate that it is
# optional.
filter {
mutate {
convert => [ "CREATED", "string" ]
add_field => { "SCHEMA" => "MYSCHEMA" }
add_field => { "TABLE" => "MY_TABLE" }
}
date {
timezone => "UTC"
match => ["CREATED", "yyyy-MM-dd HH:mm:ss.SSSSSS", "yyyy-MM-dd HH:mm:ss,SSSSSS", "ISO8601"]
}
}
output {
elasticsearch {
hosts => ["https://elasticsearch:9200"]
user => "user"
password => "pass"
index => "logs-from-db-%{+YYYY.MM.dd}"
}
}
Also, my logstash.yml file looks like this;:
pipeline:
workers: 8
batch:
size: 1250
delay: 50
path:
config: "/usr/share/logstash/pipeline/pipeline.conf"
With this setup I am able to ingest ~3 milion of records at day (~15GB), which is far way to low for my use case.
Do anybody have some hints on how to tune jdbc input plugin for huge datasets?
Thanks in advance