How to avoid data duplication in elasticsearch when data send from logstash?

Hi all,

Logstash & Elasticsearch version - 5.4.3

Steps to produce the problem :-

  1. Use the below logstash.conf file
    input {
    file {
    path => ["path of the file"]
    start_position => "beginning"
    ignore_older => 0
    }
    }

output {
elasticsearch {
hosts => "10.0.X.X"
manage_template => false
index => "logstash-%{+YYYY.MM.dd}"
}
}

  1. Start logstash

bin/logstash -f config/logstash

  1. Start elasticsearch

systemctl start elasticsearch

  1. During the transfer of logs from logstash to elasticsearch, restart the elasticsearch.

  2. When the data transfer completes, the docs.counts value(1100) on elasticsearch is more than the no of lines of input file(1000).

Please give me the solution to avoid the data duplication in elasticsearch
Regards
Nikhil kapoor

Have a look at this blog post, which covers how to avoid duplicates.

Hi @Christian_Dahlqvist,

I have gone through the given link and used the below logstash.conf file
input{
file{
path => "path of log file"
start_position => "beginning"
ignore_older => 0
}
}

filter{
grok{
match => { "message" => "%{COMBINEDAPACHELOG}" }
}
}

output {
elasticsearch {
hosts => "10.x.x.x"
manage_template => false
document_id => "%{IPORHOST:clientip}"
index => "logstash-%{+YYYY.MM.dd}"
}
}

In normal scenario:-
When logs are transferred from logstash to elasticsearch, it is observed that the no of input lines of log file are "5" and the docs.count value is "1". Below is the output:-

Can you just help me how to use document_id to avoid duplication of data?

Did you read the blog I linked to? Use the fingerprint filter on the message, e.g. with a MD5 or SHA1 hash (MURMUR3 generally has too high collision risk) and then use this fingerprint as document id.

Thanks @Christian_Dahlqvist for the replies.
Fingerprint filter solved my problem.

Regards
Nikhil Kapoor

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.