Duplication in elasticsearch

Dinesh_Vijay · May 2, 2018, 6:39am

Hi,

We have a logstash configuration for our logging application to show the logs on kibana dashboard. But we are having duplicate entries in elasticsearch. we did the workaround which is mentioned below blog.

Our configration is like

We have a application which generates 100-150 MB logs every day so we are streaming logs from application to rabbitmq first and after filtering we are putting the logs into elasticsearch using logstash configuration files.

Can you suggest something how to solve this issue?

Christian_Dahlqvist · May 2, 2018, 6:47am

What does your current configuration look like?

Dinesh_Vijay · May 2, 2018, 6:55am

We have a datacenter, where we have deployed java application which is generating application logs

We are gathering these logs into a particular location inside a box, where we have configured the ELK along with rabbitmq.

From the stored log location, logstash will push logs to rabbitmq queue and from there after filtering, logstash will push the logs into elasticsearch.

Christian_Dahlqvist · May 2, 2018, 7:05am

What does your Logstash configuration look like? What is the problem you are seeing?

Dinesh_Vijay · May 2, 2018, 8:02am

Feeder Configuration -

input {
file {
path => "/home/US1_application*.log"
type => "US1_application-log"
sincedb_path => "/home/jenkinselk/US1_since.db"
codec => multiline {
pattern => "^%{YEAR} %{MONTHNUM} %{MONTHDAY} %{TIME}"
negate => true
what => previous
}
}
}

filter{
fingerprint {
target => "generated_id"
method => "UUID"
}
}

output {

if [type] == "US1_application-log" {
rabbitmq {
exchange => "UITD-US"
exchange_type => "topic"
host => "localhost"
user =>
password =>
durable => "true"
key => "US1.application"
}
}
}

Worker Configuration -

input {
rabbitmq {
host => "localhost"
user =>
password =>
exchange => "UITD-US"
durable => "true"
queue => "US1_ApplicationLog"
key => "US1.application"
}
}

filter {
if [type] == "US1_application-log" {
grok {
match => ["message", "%{YEAR:Year} %{MONTHNUM:Month} %{MONTHDAY:Day} %{TIME:Time}#%{INT}#%{LOGLEVEL:Level}#%{GREEDYDATA:Logger}##%{GREEDYDATA}#na#%{WORD:Tenant}%{GREEDYDATA}#%{WORD:Context}#%{WORD:Transaction}#%{WORD:Connection}#%{WORD:Counter}#%{GREEDYDATA:Text}"]
add_tag => ["us1_applogs_parsing_successful"]
tag_on_failure => ["us1_applogs_parsing_failed"]
}
date {
match => [ "timestamp", "yyyy MM dd HH:mm:ss" ]
}
}
}

output {
if "us1_applogs_parsing_failed" in [tags]
{
elasticsearch {
hosts => ["localhost:9200"]
index => "us1_applogs_failed_index"
document_type => "US1_Applicationlogs_failed"
document_id => "%{[generated_id]}"

}
}
else if "us1_applogs_parsing_successful" in [tags]
{
elasticsearch {
hosts => ["localhost:9200"]
index => "us1_applogs_index"
document_type => "US1_Applicationlogs"
document_id => "%{[generated_id]}"
}
}
}

Versions Used -

Logstash - 6.2.2
Elasticsearch - 6.2.2
Rabbitmq - 3.6.15
Erlang - 20.2
Kibana - 6.2.2

There is some issue with file plugin of Logstash, it is re-reading the lines in Application log. We only need it to read the delta changes in the file. As you can see in feeder configuration we are already using fingerprint plugin, still the issue persists. Please check the above configuration and let us know if there is something incorrectly used and suggest changes as necessary.

Dinesh_Vijay · May 2, 2018, 8:06am

And the application log file is getting replaced with the same name in every 5 mins.

Christian_Dahlqvist · May 2, 2018, 8:26am

What type of storage are you reading the logs from?

Dinesh_Vijay · May 2, 2018, 8:32am

When it comes to feeder configuration we are reading input application log files from local filesystem(ext4-ubuntu) and for worker we are consuming from the rabbitmq queue.

Christian_Dahlqvist · May 2, 2018, 10:10am

If you are replacing a file, it will appear as a new file to Logstash as the node is different even if it has the same name. If the file you are replacing with contains old and new data, all this will be reprocessed. You should ideally add new data by appending data to the file rather than replacing it.

Dinesh_Vijay · May 8, 2018, 10:43am

Hi,
I tested the entire configuration with 1 input file ( been generated by a jenkins job ), this file would be the input to logstash feeder configuration which will send it to rabbitmq and later logstash worker will receive these message from rabbitmq and send it to elastic search

This is what i observed -
The first time i run jenkins job it creates input file --> Logstash picks up contents from this file --> adds offset in since db and send it to rabbitmq --> Logstash worker picks up the same from rabbitmq --> sends it to Elasticsearch

Lets consider i have 3 logs for the first time i can see 3 logs in Kibana as well

When I run the jenkins job again, it adds lets say 2 more lines to the input file --> Logstash picks up contents from the input file --> updates offset in since db and send it to rabbitmq
but i see total 8 logs in Kibana now.

I can confirm that inode of the file is not changed after Jenkins job is run, sincedb offset is correct then how there are multiple entries in Elasticsearch ? I suspect Rabbitmq is re-delivering the already delivered messages again to elastic search ? Is this possible ?

system · June 5, 2018, 10:43am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Duplicate logs Elasticsearch	14	6568	July 10, 2018
Duplicate log entries Elasticsearch	18	3730	January 20, 2021
Found duplicate records in elasticsearch Logstash	8	2487	December 25, 2017
Remove duplications in Kibana dashboard Kibana	3	936	October 11, 2018
Duplicate records when scaling logstash Logstash	6	1556	July 6, 2017

Duplication in elasticsearch

Related topics