Storage Size PROBLEM while Insert complex nested json documents from Postgres to Elasticsearch via Logstash

Hi,
I configured a pipepline to transfer complex nested json documents from postgres to elasticsearch via logstash. This is the configuration file:

file.conf

input {
jdbc {

    jdbc_connection_string => "jdbc:postgresql://localhost:5432/mydb"
    jdbc_user => myuser
    jdbc_driver_library => "/etc/logstash/postgresql-42.2.5.jar"
    jdbc_driver_class => "org.postgresql.Driver"
    jdbc_password => mypassword
    statement => "SELECT document::text from snapshots"
    # scheduled every 2 hours
    schedule => "0 */2 * * *"
    jdbc_paging_enabled => "true"
    jdbc_page_size => "50000"
}

}

filter{
fingerprint{
source => "document"
target => "[@metadata][fingerprint]"
method => "MD5"
}

    json{
            source => "document"
            remove_field => ["document"]
    }

}

output {
elasticsearch {
index => "snapshots"
#document_id => "%{document->>'teorema_msd_id'}"
document_id => "%{[@metadata][fingerprint]}"
hosts => ["localhost"]
}
}

Everything worked initially, but I realized several times that, at each iteration, the storage size increased as if the data were duplicating (in reality at each iteration new documents were added without duplication) and after a few seconds it returned to the size of documents normal. But this has become a problem because now I have exceeded the space on the server and this has created an internal error in the system.
Why does the storage size increase as if the data on postgres were loaded all of them each time? How do I solve the problem?
I imagine that I have to specify to load only the new ones, but how can I give it an offset from where to start?

Your configuration does load all documents every time as you are not using a where clause together with the sql_last_value parameter. As segments are immutable this leads to all data written to disk before old and replaced documents are deleted, which explains why your disk usage fluctuates.

1 Like

I realized that this is obviously the bottleneck.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.