Storage Size PROBLEM while Insert complex nested json documents from Postgres to Elasticsearch via Logstash

MARCO_RAMBALDI · July 1, 2019, 12:48pm

Hi,
I configured a pipepline to transfer complex nested json documents from postgres to elasticsearch via logstash. This is the configuration file:

file.conf

input {
jdbc {

    jdbc_connection_string => "jdbc:postgresql://localhost:5432/mydb"
    jdbc_user => myuser
    jdbc_driver_library => "/etc/logstash/postgresql-42.2.5.jar"
    jdbc_driver_class => "org.postgresql.Driver"
    jdbc_password => mypassword
    statement => "SELECT document::text from snapshots"
    # scheduled every 2 hours
    schedule => "0 */2 * * *"
    jdbc_paging_enabled => "true"
    jdbc_page_size => "50000"
}

}

filter{
fingerprint{
source => "document"
target => "[@metadata][fingerprint]"
method => "MD5"
}

    json{
            source => "document"
            remove_field => ["document"]
    }

}

output {
elasticsearch {
index => "snapshots"
#document_id => "%{document->>'teorema_msd_id'}"
document_id => "%{[@metadata][fingerprint]}"
hosts => ["localhost"]
}
}

Everything worked initially, but I realized several times that, at each iteration, the storage size increased as if the data were duplicating (in reality at each iteration new documents were added without duplication) and after a few seconds it returned to the size of documents normal. But this has become a problem because now I have exceeded the space on the server and this has created an internal error in the system.
Why does the storage size increase as if the data on postgres were loaded all of them each time? How do I solve the problem?
I imagine that I have to specify to load only the new ones, but how can I give it an offset from where to start?

Christian_Dahlqvist · July 2, 2019, 5:30am

Your configuration does load all documents every time as you are not using a where clause together with the sql_last_value parameter. As segments are immutable this leads to all data written to disk before old and replaced documents are deleted, which explains why your disk usage fluctuates.

MARCO_RAMBALDI · July 2, 2019, 1:26pm

I realized that this is obviously the bottleneck.

system · July 30, 2019, 1:26pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Inserting a Complex Nested Json from Postgres to Elasticsearch via Logstash Logstash	12	7229	July 6, 2017
Transform large file from postgresql to elasticsearch Logstash elastic-stack-sql	6	1237	January 15, 2020
JDBC plugin ignores and duplicates entities Logstash	1	463	September 20, 2017
Logstash JDBC missing many documents Logstash	0	88	June 5, 2024
Postgres with Elastic search issue Logstash	4	539	February 1, 2021

Storage Size PROBLEM while Insert complex nested json documents from Postgres to Elasticsearch via Logstash

file.conf

Related Topics