Logstash produces duplicates


(Vlad Miller) #1

Hello,

So the goal is to import existing mysql table which has about 2 million records in the ES index. However, ES index after a while has much more data.

I also try to generate unique sha1 fingerprint of each message and use it as document_id to avoid, duplicates.

However, even trough original mysql table has 2m records, new ES index would have much more after a while.

What could be the problem and how is it possible to fix it?

Here is my config

input {
jdbc {
jdbc_driver_library => "/app/bin/mysql-connector-java-5.1.37-bin.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://testdatabase.xxxxxxxx.us-west-2.rds.amazonaws.com:3306/test"
jdbc_page_size => 25000
jdbc_paging_enabled => true
statement => "SELECT * FROM Table"
}
}

filter {
ruby {
code => "
require 'digest/sha1';
event['fingerprint'] = Digest::SHA1.hexdigest(event.to_json);
"
}
}

output {
elasticsearch {
hosts => ["host:80"]
index => "fcblive"
document_type => "action"
document_id => "%{fingerprint}"
}
}


(Magnus B├Ąck) #2

Are you ever restarting Logstash or does it produce duplicates even with a single Logstash execution?

Exactly what does an event look like? If Logstash adds the @timestamp field with the current time when a database record is read from the database the SHA-1 digest will be different every time a particular database record is processed.


(Vlad Miller) #3

It produces duplicates from a single logstash run.

I see your point, my fingerprint shouldn't work at all as expected because it will include @timestamp field. This however does not answer the question, why logstash would insert multiple duplicates.

I run commands with nohup, so basically if command fails it shouldn't be restarted.


(system) #4