Duplicate events with filebeat -> logstash -> elasticsearch pipeline

an.secure · October 26, 2017, 7:01am

I have below pipeline to parse the log files written by an application and send them to elasticsearch :

filebeat -> logstash -> elasticsearch

For some time my elasticsearch node went down. When I restarted the elasticsearch after around 30 mins, I observed that there are multiple entries for some log messages in Kibana. This happened for 30 log messages in my case. What may be the reason for this?

filebeat config :
filebeat.prospectors:

input_type: log

paths:
- C:\Users\log.json
- C:\Users\arch_*.json
document_type: json
json.keys_under_root: true
json.add_error_key: true
json.overwrite_keys: true

fields_under_root: true
close_inactive: 72h
clean_inactive: 72h
ignore_older: 48h

output.logstash:
hosts: ["localhost:5044"]

Logstash Config :
input {
beats {
port => 5044
}
}

filter {
ruby {
code=> "event.set('read_timestamp', Time.new)"
}
mutate {
remove_field => ["beat", "tags", "input_type"]
}
}

output {
elasticsearch {
hosts => "localhost:9200"
index => "filebeat-logstash-%{+YYYY.MM.dd}"
}
}

an.secure · October 30, 2017, 7:40am

Any pointers here would be very helpful.
I have captured the debug logs from logstash. Please let me know how do I post the file/logs for further analysis. The size of the logs is huge don't want to paste here.

paz · October 30, 2017, 2:35pm

The way ElasticSearch decides document uniqueness is it's _id. By default, if none is provided, ElasticSearch will auto-generate one itself upon receiving any document insert request. So if you just send the same document multiple times it will be indexed as separate documents.

So, this is probably because you do not set a unique _id yourself. If you want to avoid it you should create a unique id for each document and supply it to the elasticsearch output.

As far as I know ID generation is not yet supported in beats components, as per https://github.com/elastic/beats/issues/5269, but you can probably use Logstash's UUID filter or just construct your own.

an.secure · October 30, 2017, 4:25pm

@paz Thanks for the suggestion. I will try that out.
I would like to understand how the pipeline works. Why logstash or filebeat (via logstash) is sending the event multiple times to elasticsearch.

paz · October 31, 2017, 2:06pm

I assume it has to do something with Logstash trying to be fault-tolerant.
It might be possible that when ElasticSearch went down, it indexed the last bulk request (or a part of it), but Logtash never received acknowledgement of a successful operation, so it kept trying to reprocess that bulk request.
When eventually ES came back up, it received the request again but treating it as a new batch of documents (since there was no unique identifier) causing some documents to be indexed twice.

an.secure · October 31, 2017, 4:42pm

In my case I am seeing the events multiple times not just twice. some events even got to elasticsearch 6 times.

system · November 28, 2017, 4:42pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Message Created Twice in Elasticsearch via Logstash Elasticsearch	8	1803	August 25, 2018
Duplicate events in filebeat + logstash + elasticsearch pipeline Logstash	2	1942	July 6, 2017
Found duplicate records in elasticsearch Logstash	8	2522	December 25, 2017
Duplicate Events Logstash	3	1885	July 6, 2017
Duplicate log entries Elasticsearch	18	4096	January 20, 2021

Duplicate events with filebeat -> logstash -> elasticsearch pipeline

Related topics