Migrating 3 millions of records from RDBMS to Elastic Search using logstash

Hi All,

We are trying to migrate around 3 million records from oracle to Elastic Search using Logstash.
We are applying a couple of jdbc_streaming filters as a part of our logstash script, one to load connecting nested objects and another to run a hierarchical query to load data to another nested object in the index.

We are able to index 0.4 million records in 24 hours. The total size occupied by .4 million records is around 300MB.
We tried multiple approaches to migrate data quickly into elastic from oracle but were not able to achieve desired results.
Please find below the approaches we tried :
1.In the logstash script,
we used jdbc_fetch_size,
jdbc_page_size,
jdbc_paging_enabled,
clean_run parameters,
set pipeline workers to 20 and
pipeline batch size to 125 in logstash.yml file.
2. On the elastic side,
we set the number of replicas to 0,
refresh interval to -1,
tried increasing the value of indices.memory.index_buffer_size parameter, increased number of watcher queues in the elastic.yml file.

We basically googled out and followed various suggestions from this site and others too but nothing seems to work out so far.

We are using a single node elastic setup and neither the DB nor the elastic node are present on the machine from which we are running the logstash script.

Please find below the logstash config file

input {
jdbc {
jdbc_driver_library => "LIB"
jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
jdbc_connection_string => "connection url"
jdbc_user => "user"
jdbc_password => "pwd"
statement => "select * from "
}
}

filter{
jdbc_streaming {
jdbc_driver_library => "LIB"
jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
jdbc_connection_string => "connection url"
jdbc_user => "user"
jdbc_password => "pwd"
#statement => "select claimnumber,claimtype,is_active from claim where policynumber = :policynumber"
parameters => {"policynumber" => "policynumber"}
target => "nested node"
}
stdout { codec => json }
}

filter{
jdbc_streaming {
jdbc_driver_library => "LIB"
jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
jdbc_connection_string => "connection url"
jdbc_user => "user"
jdbc_password => "pwd"
statement => "select listagg(column name,'/' ) within group(order by column name) from

where LEVEL > 1
start with =:
connect by prior = "
parameters => {"p1" => "p1"}
target => "nested node1"
}
}

output {
elasticsearch {
hosts => [""]
index => "<index_name>"
document_id => "%{doc_id}"
}
}
Can you please help us identify bottlenecks and also make suggestions on how to increase indexing performance.

Thank You

Rather than using the two filters, what about making a view on the database that provides you with the end structure you need and then just using that in the input?

Thanks, @warkolm.

We are representing a one to many relationships with one of the nested objects created using filters. We would be eliminating one filter by creating views.
We would try this and get back.

Meanwhile, If there are any more suggestions, please let us know.

@warkolm,

We tried loading oracle table data directly to elastic without any transformations and the performance is same. It is still taking around a day to load around a million records.
Will there be any other settings to speed up the indexing process.

Not sure sorry, hopefully someone else will be able to assist.

You need to identify where the bottleneck is. Start off by just running your jdbc input with a sink output. Then add one filter, then add the other. Then add something like

output { stdout { codec => dots } }

Then try sending the data to elasticsearch. If the throughput massively drops when you start sending to data to elasticsearch the bottleneck may be in ES rather than logstash.

No problem :slight_smile:
Thanks for your inputs @warkolm

@Badger,
I tired running the script by removing elastic search output tags and tried the given output tags.
I see messages displayed on the console that data is being fetched from DB in batches of 500 until 5000 after which no messages are displayed on to the console

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.