How to avoid data duplication in elasticsearch when data send from logstash?

nikhil.k · September 19, 2018, 2:55am

Hi all,

Logstash & Elasticsearch version - 5.4.3

Steps to produce the problem :-

Use the below logstash.conf file
input {
file {
path => ["path of the file"]
start_position => "beginning"
ignore_older => 0
}
}

output {
elasticsearch {
hosts => "10.0.X.X"
manage_template => false
index => "logstash-%{+YYYY.MM.dd}"
}
}

Start logstash

bin/logstash -f config/logstash

Start elasticsearch

systemctl start elasticsearch

During the transfer of logs from logstash to elasticsearch, restart the elasticsearch.
When the data transfer completes, the docs.counts value(1100) on elasticsearch is more than the no of lines of input file(1000).

Please give me the solution to avoid the data duplication in elasticsearch
Regards
Nikhil kapoor

Christian_Dahlqvist · September 19, 2018, 5:26am

Have a look at this blog post, which covers how to avoid duplicates.

nikhil.k · September 19, 2018, 6:28am

Hi @Christian_Dahlqvist,

I have gone through the given link and used the below logstash.conf file
input{
file{
path => "path of log file"
start_position => "beginning"
ignore_older => 0
}
}

filter{
grok{
match => { "message" => "%{COMBINEDAPACHELOG}" }
}
}

output {
elasticsearch {
hosts => "10.x.x.x"
manage_template => false
document_id => "%{IPORHOST:clientip}"
index => "logstash-%{+YYYY.MM.dd}"
}
}

In normal scenario:-
When logs are transferred from logstash to elasticsearch, it is observed that the no of input lines of log file are "5" and the docs.count value is "1". Below is the output:-

Can you just help me how to use document_id to avoid duplication of data?

Christian_Dahlqvist · September 19, 2018, 6:58am

Did you read the blog I linked to? Use the fingerprint filter on the message, e.g. with a MD5 or SHA1 hash (MURMUR3 generally has too high collision risk) and then use this fingerprint as document id.

nikhil.k · September 19, 2018, 10:49am

Thanks @Christian_Dahlqvist for the replies.
Fingerprint filter solved my problem.

Regards
Nikhil Kapoor

system · October 17, 2018, 10:49am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to avoid elasticsearch duplicate documents Logstash	6	1670	March 5, 2018
How can i send duplicate lines (from logstash to elasticsearch) in log file Logstash	6	1108	April 13, 2020
Elasticsearch input on logstash duplicating documents Logstash	1	295	October 20, 2020
Duplicate log entries Elasticsearch	18	3862	January 20, 2021
Logstash write data to the elasticsearch how to remove duplication Logstash	4	653	July 6, 2017

How to avoid data duplication in elasticsearch when data send from logstash?

bin/logstash -f config/logstash

systemctl start elasticsearch

Related topics