Hi,
I configured a pipepline to transfer complex nested json documents from postgres to elasticsearch via logstash. This is the configuration file:
file.conf
input {
jdbc {
jdbc_connection_string => "jdbc:postgresql://localhost:5432/mydb"
jdbc_user => myuser
jdbc_driver_library => "/etc/logstash/postgresql-42.2.5.jar"
jdbc_driver_class => "org.postgresql.Driver"
jdbc_password => mypassword
statement => "SELECT document::text from snapshots"
# scheduled every 2 hours
schedule => "0 */2 * * *"
jdbc_paging_enabled => "true"
jdbc_page_size => "50000"
}
}
filter{
fingerprint{
source => "document"
target => "[@metadata][fingerprint]"
method => "MD5"
}
json{
source => "document"
remove_field => ["document"]
}
}
output {
elasticsearch {
index => "snapshots"
#document_id => "%{document->>'teorema_msd_id'}"
document_id => "%{[@metadata][fingerprint]}"
hosts => ["localhost"]
}
}
Everything worked initially, but I realized several times that, at each iteration, the storage size increased as if the data were duplicating (in reality at each iteration new documents were added without duplication) and after a few seconds it returned to the size of documents normal. But this has become a problem because now I have exceeded the space on the server and this has created an internal error in the system.
Why does the storage size increase as if the data on postgres were loaded all of them each time? How do I solve the problem?
I imagine that I have to specify to load only the new ones, but how can I give it an offset from where to start?