Fingerprint issue


(Khoa Nguyen) #1

My CSV file has 26041 unique lines (I checked by manually generating sha1sum for each line). However, when using fingerprint as the document_id, elastic only stores 25919 documents. One document has the wrong _id:

_id: %{fingerprint} _type:logs _index:logstash-2015.07.14

This may have caused 122 records failed to load. Below is an excerpt of my config:

input { stdin {} }

filter {
    csv { ... }
    fingerprint { method => "SHA1" key => "RoamMonitor" }
}

output {
    elasticsearch {
        host => localhost
        cluster => "rm_cluster_dev"
        document_id => "%{fingerprint}"
    }
}

If I disable the fingerprint and use the auto_id, all 26041 records were loaded.

Any ideas on how to debug/troubleshoot this?


(Christian Dahlqvist) #2

Calculate the fingerprint as you do now, but insert into Elasticsearch using auto_id. You should then be able to identify the entries that result in duplicates, which should allow you to troubleshoot the issue.


(Khoa Nguyen) #3

I've been busy, but finally can get back to this. The issue was my configuration -- it turned out that I have some if statements that mess things up. Below is a simplified version of my config file:

input { stdin {} }

filter {
    if [message] =~ /^RECORD_A.*/  {
        fingerprint { 
            method => "SHA1"
            key => "mykey"
       }
    }
}

output { 
    elasticsearch { document_id => "%{fingerprint}" }
}

So if input record is NOT RECORD_A, the fingerprint is not computed, and document_id will take the string "%{fingerprint}" verbatim.


(system) #4