Duplication in elasticsearch

Hi,

We have a logstash configuration for our logging application to show the logs on kibana dashboard. But we are having duplicate entries in elasticsearch. we did the workaround which is mentioned below blog.

Our configration is like

We have a application which generates 100-150 MB logs every day so we are streaming logs from application to rabbitmq first and after filtering we are putting the logs into elasticsearch using logstash configuration files.

Can you suggest something how to solve this issue?

What does your current configuration look like?

We have a datacenter, where we have deployed java application which is generating application logs

We are gathering these logs into a particular location inside a box, where we have configured the ELK along with rabbitmq.

From the stored log location, logstash will push logs to rabbitmq queue and from there after filtering, logstash will push the logs into elasticsearch.

What does your Logstash configuration look like? What is the problem you are seeing?

Feeder Configuration -

input {
file {
path => "/home/US1_application*.log"
type => "US1_application-log"
sincedb_path => "/home/jenkinselk/US1_since.db"
codec => multiline {
pattern => "^%{YEAR} %{MONTHNUM} %{MONTHDAY} %{TIME}"
negate => true
what => previous
}
}
}

filter{
fingerprint {
target => "generated_id"
method => "UUID"
}
}

output {

if [type] == "US1_application-log" {
rabbitmq {
exchange => "UITD-US"
exchange_type => "topic"
host => "localhost"
user =>
password =>
durable => "true"
key => "US1.application"
}
}
}

Worker Configuration -

input {
rabbitmq {
host => "localhost"
user =>
password =>
exchange => "UITD-US"
durable => "true"
queue => "US1_ApplicationLog"
key => "US1.application"
}
}

filter {
if [type] == "US1_application-log" {
grok {
match => ["message", "%{YEAR:Year} %{MONTHNUM:Month} %{MONTHDAY:Day} %{TIME:Time}#%{INT}#%{LOGLEVEL:Level}#%{GREEDYDATA:Logger}##%{GREEDYDATA}#na#%{WORD:Tenant}%{GREEDYDATA}#%{WORD:Context}#%{WORD:Transaction}#%{WORD:Connection}#%{WORD:Counter}#%{GREEDYDATA:Text}"]
add_tag => ["us1_applogs_parsing_successful"]
tag_on_failure => ["us1_applogs_parsing_failed"]
}
date {
match => [ "timestamp", "yyyy MM dd HH:mm:ss" ]
}
}
}

output {
if "us1_applogs_parsing_failed" in [tags]
{
elasticsearch {
hosts => ["localhost:9200"]
index => "us1_applogs_failed_index"
document_type => "US1_Applicationlogs_failed"
document_id => "%{[generated_id]}"

}
}
else if "us1_applogs_parsing_successful" in [tags]
{
elasticsearch {
hosts => ["localhost:9200"]
index => "us1_applogs_index"
document_type => "US1_Applicationlogs"
document_id => "%{[generated_id]}"
}
}
}

Versions Used -

Logstash - 6.2.2
Elasticsearch - 6.2.2
Rabbitmq - 3.6.15
Erlang - 20.2
Kibana - 6.2.2

There is some issue with file plugin of Logstash, it is re-reading the lines in Application log. We only need it to read the delta changes in the file. As you can see in feeder configuration we are already using fingerprint plugin, still the issue persists. Please check the above configuration and let us know if there is something incorrectly used and suggest changes as necessary.

And the application log file is getting replaced with the same name in every 5 mins.

What type of storage are you reading the logs from?

When it comes to feeder configuration we are reading input application log files from local filesystem(ext4-ubuntu) and for worker we are consuming from the rabbitmq queue.

If you are replacing a file, it will appear as a new file to Logstash as the node is different even if it has the same name. If the file you are replacing with contains old and new data, all this will be reprocessed. You should ideally add new data by appending data to the file rather than replacing it.

Hi,
I tested the entire configuration with 1 input file ( been generated by a jenkins job ), this file would be the input to logstash feeder configuration which will send it to rabbitmq and later logstash worker will receive these message from rabbitmq and send it to elastic search

This is what i observed -
The first time i run jenkins job it creates input file --> Logstash picks up contents from this file --> adds offset in since db and send it to rabbitmq --> Logstash worker picks up the same from rabbitmq --> sends it to Elasticsearch

Lets consider i have 3 logs for the first time i can see 3 logs in Kibana as well

When I run the jenkins job again, it adds lets say 2 more lines to the input file --> Logstash picks up contents from the input file --> updates offset in since db and send it to rabbitmq
but i see total 8 logs in Kibana now.

I can confirm that inode of the file is not changed after Jenkins job is run, sincedb offset is correct then how there are multiple entries in Elasticsearch ? I suspect Rabbitmq is re-delivering the already delivered messages again to elastic search ? Is this possible ?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.