Fingerprint issue

My CSV file has 26041 unique lines (I checked by manually generating sha1sum for each line). However, when using fingerprint as the document_id, elastic only stores 25919 documents. One document has the wrong _id:

_id: %{fingerprint} _type:logs _index:logstash-2015.07.14

This may have caused 122 records failed to load. Below is an excerpt of my config:

input { stdin {} }

filter {
    csv { ... }
    fingerprint { method => "SHA1" key => "RoamMonitor" }
}

output {
    elasticsearch {
        host => localhost
        cluster => "rm_cluster_dev"
        document_id => "%{fingerprint}"
    }
}

If I disable the fingerprint and use the auto_id, all 26041 records were loaded.

Any ideas on how to debug/troubleshoot this?

Calculate the fingerprint as you do now, but insert into Elasticsearch using auto_id. You should then be able to identify the entries that result in duplicates, which should allow you to troubleshoot the issue.

I've been busy, but finally can get back to this. The issue was my configuration -- it turned out that I have some if statements that mess things up. Below is a simplified version of my config file:

input { stdin {} }

filter {
    if [message] =~ /^RECORD_A.*/  {
        fingerprint { 
            method => "SHA1"
            key => "mykey"
       }
    }
}

output { 
    elasticsearch { document_id => "%{fingerprint}" }
}

So if input record is NOT RECORD_A, the fingerprint is not computed, and document_id will take the string "%{fingerprint}" verbatim.