Transform large file from postgresql to elasticsearch

Hi everybody
I have an issue when dealing with large table in postgresql. I have a table with about 1 millions rows, each row contains some text, about half of A4 page of length. I want to index this table into elasticsearch. But i always got java.lang.OutOfMemoryError: Java heap space. I increased jvm heap size to 4Gb and i cant increase more. I also add jdbc_page_size options to my logstash config file but it doesn't work.

input {
    jdbc {
        # Postgres jdbc connection string to our database, mydb
        jdbc_connection_string => "jdbc:postgresql://localhost:5432/jmdb"
        # The user we wish to execute our statement as
        jdbc_user => "xxx"
        # The path to our downloaded jdbc driver
        jdbc_driver_library => "${HOME}/postgresql-42.2.8.jar"
        # The name of the driver class for Postgresql
        jdbc_driver_class => "org.postgresql.Driver"
        jdbc_password => "xxx"
        jdbc_paging_enabled => true
        jdbc_page_size => 10000
        statement_filepath => "${INDEXING_DIRECTORY}/decision_index.sql"
        type => "decision"
    }
}
output {
        elasticsearch {
            index => "decision"
        }
}

Someone know how to solve this situation. Or maybe a way to monitor jdbc_page_size to know how many jdbc_page_size i need to not dump java heap size ?
Thank you alot.

try something like
select * from table where data between 01/01/2019 and 01/31/2019

and go one month at a time.

1 Like

Thanks for your help.
But what if i dont have data column in my table. Can i use something else like id column ?

yes you can try
id_column > 123454 something like this?

1 Like

I applied your method by using schedule and it works. Thank you.
But i'm wondering are there other way than using schedule, because i want logstash to shut down after doing all the jobs. But with schedule, i can't.

I had one such request, and here is what I did.
Created bash script on one of the elk node where I don't run logstash as daemon.
run that bash script via cron, so it run every other hour and shut down

#cat test.bash
/usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/my_test.conf

and in this my_test.conf I don't have schedule so it will run right away and shutdown after it finish

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.