Fingerprint issue

Khoa_Nguyen1 · July 14, 2015, 8:43pm

My CSV file has 26041 unique lines (I checked by manually generating sha1sum for each line). However, when using fingerprint as the document_id, elastic only stores 25919 documents. One document has the wrong _id:

_id: %{fingerprint} _type:logs _index:logstash-2015.07.14

This may have caused 122 records failed to load. Below is an excerpt of my config:

input { stdin {} }

filter {
    csv { ... }
    fingerprint { method => "SHA1" key => "RoamMonitor" }
}

output {
    elasticsearch {
        host => localhost
        cluster => "rm_cluster_dev"
        document_id => "%{fingerprint}"
    }
}

If I disable the fingerprint and use the auto_id, all 26041 records were loaded.

Any ideas on how to debug/troubleshoot this?

Christian_Dahlqvist · July 26, 2015, 10:41am

Calculate the fingerprint as you do now, but insert into Elasticsearch using auto_id. You should then be able to identify the entries that result in duplicates, which should allow you to troubleshoot the issue.

Khoa_Nguyen1 · September 3, 2015, 6:19pm

I've been busy, but finally can get back to this. The issue was my configuration -- it turned out that I have some if statements that mess things up. Below is a simplified version of my config file:

input { stdin {} }

filter {
    if [message] =~ /^RECORD_A.*/  {
        fingerprint { 
            method => "SHA1"
            key => "mykey"
       }
    }
}

output { 
    elasticsearch { document_id => "%{fingerprint}" }
}

So if input record is NOT RECORD_A, the fingerprint is not computed, and document_id will take the string "%{fingerprint}" verbatim.

Topic		Replies	Views
Logstash Fingerprint Issue Logstash	1	349	November 5, 2020
Fingerprint does not work as expected Elasticsearch	3	1103	June 1, 2016
Problem using fingerprint for my documents stored in elasticsearch Logstash	17	2419	March 2, 2022
Logstash使用fingerprint插件去重复出现问题中文提问与讨论	4	5489	July 6, 2017
Duplicate Lines Logstash	1	501	January 6, 2017

Fingerprint issue

Related topics