Migrating 3 millions of records from RDBMS to Elastic Search using logstash

rao7806 · July 8, 2020, 8:08pm

Hi All,

We are trying to migrate around 3 million records from oracle to Elastic Search using Logstash.
We are applying a couple of jdbc_streaming filters as a part of our logstash script, one to load connecting nested objects and another to run a hierarchical query to load data to another nested object in the index.

We are able to index 0.4 million records in 24 hours. The total size occupied by .4 million records is around 300MB.
We tried multiple approaches to migrate data quickly into elastic from oracle but were not able to achieve desired results.
Please find below the approaches we tried :
1.In the logstash script,
we used jdbc_fetch_size,
jdbc_page_size,
jdbc_paging_enabled,
clean_run parameters,
set pipeline workers to 20 and
pipeline batch size to 125 in logstash.yml file.
2. On the elastic side,
we set the number of replicas to 0,
refresh interval to -1,
tried increasing the value of indices.memory.index_buffer_size parameter, increased number of watcher queues in the elastic.yml file.

We basically googled out and followed various suggestions from this site and others too but nothing seems to work out so far.

We are using a single node elastic setup and neither the DB nor the elastic node are present on the machine from which we are running the logstash script.

Please find below the logstash config file

input {
jdbc {
jdbc_driver_library => "LIB"
jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
jdbc_connection_string => "connection url"
jdbc_user => "user"
jdbc_password => "pwd"
statement => "select * from "
}
}

filter{
jdbc_streaming {
jdbc_driver_library => "LIB"
jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
jdbc_connection_string => "connection url"
jdbc_user => "user"
jdbc_password => "pwd"
#statement => "select claimnumber,claimtype,is_active from claim where policynumber = :policynumber"
parameters => {"policynumber" => "policynumber"}
target => "nested node"
}
stdout { codec => json }
}

filter{
jdbc_streaming {
jdbc_driver_library => "LIB"
jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
jdbc_connection_string => "connection url"
jdbc_user => "user"
jdbc_password => "pwd"
statement => "select listagg(column name,'/' ) within group(order by column name) from

where LEVEL > 1
start with =:
connect by prior = "
parameters => {"p1" => "p1"}
target => "nested node1"
}
}

output {
elasticsearch {
hosts => [""]
index => "<index_name>"
document_id => "%{doc_id}"
}
}
Can you please help us identify bottlenecks and also make suggestions on how to increase indexing performance.

Thank You

warkolm · July 8, 2020, 10:18pm

Rather than using the two filters, what about making a view on the database that provides you with the end structure you need and then just using that in the input?

rao7806 · July 9, 2020, 5:37am

Thanks, @warkolm.

We are representing a one to many relationships with one of the nested objects created using filters. We would be eliminating one filter by creating views.
We would try this and get back.

Meanwhile, If there are any more suggestions, please let us know.

rao7806 · July 16, 2020, 8:46am

@warkolm,

We tried loading oracle table data directly to elastic without any transformations and the performance is same. It is still taking around a day to load around a million records.
Will there be any other settings to speed up the indexing process.

warkolm · July 16, 2020, 10:40pm

Not sure sorry, hopefully someone else will be able to assist.

Badger · July 17, 2020, 1:59am

You need to identify where the bottleneck is. Start off by just running your jdbc input with a sink output. Then add one filter, then add the other. Then add something like

output { stdout { codec => dots } }

Then try sending the data to elasticsearch. If the throughput massively drops when you start sending to data to elasticsearch the bottleneck may be in ES rather than logstash.

rao7806 · July 17, 2020, 2:52pm

No problem
Thanks for your inputs @warkolm

rao7806 · July 17, 2020, 3:27pm

@Badger,
I tired running the script by removing elastic search output tags and tried the given output tags.
I see messages displayed on the console that data is being fetched from DB in batches of 500 until 5000 after which no messages are displayed on to the console

system · August 14, 2020, 3:27pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to migrate data from db oracle to ES use Logstash? Elasticsearch	2	434	October 25, 2021
Tune logstash JDBC input for huge datasets Logstash	5	1069	May 16, 2023
Logstash cannot handle huge amount of data Logstash	6	1198	February 4, 2019
Tuning logstash and elasticsearch for loading data from oracle database Elasticsearch	13	2187	September 17, 2019
Logstash jdbc-input taking long time to load data from Oracle DataBase Logstash	22	4030	October 5, 2020

Migrating 3 millions of records from RDBMS to Elastic Search using logstash

Related topics