Duplicate events with filebeat -> logstash -> elasticsearch pipeline

I have below pipeline to parse the log files written by an application and send them to elasticsearch :

filebeat -> logstash -> elasticsearch

For some time my elasticsearch node went down. When I restarted the elasticsearch after around 30 mins, I observed that there are multiple entries for some log messages in Kibana. This happened for 30 log messages in my case. What may be the reason for this?

filebeat config :
filebeat.prospectors:

  • input_type: log

    paths:

    • C:\Users\log.json
    • C:\Users\arch_*.json

    document_type: json
    json.keys_under_root: true
    json.add_error_key: true
    json.overwrite_keys: true

    fields_under_root: true
    close_inactive: 72h
    clean_inactive: 72h
    ignore_older: 48h

output.logstash:
hosts: ["localhost:5044"]

Logstash Config :
input {
beats {
port => 5044
}
}

filter {
ruby {
code=> "event.set('read_timestamp', Time.new)"
}
mutate {
remove_field => ["beat", "tags", "input_type"]
}
}

output {
elasticsearch {
hosts => "localhost:9200"
index => "filebeat-logstash-%{+YYYY.MM.dd}"
}
}

Any pointers here would be very helpful.
I have captured the debug logs from logstash. Please let me know how do I post the file/logs for further analysis. The size of the logs is huge don't want to paste here.

The way ElasticSearch decides document uniqueness is it's _id. By default, if none is provided, ElasticSearch will auto-generate one itself upon receiving any document insert request. So if you just send the same document multiple times it will be indexed as separate documents.

So, this is probably because you do not set a unique _id yourself. If you want to avoid it you should create a unique id for each document and supply it to the elasticsearch output.

As far as I know ID generation is not yet supported in beats components, as per https://github.com/elastic/beats/issues/5269, but you can probably use Logstash's UUID filter or just construct your own.

@paz Thanks for the suggestion. I will try that out.
I would like to understand how the pipeline works. Why logstash or filebeat (via logstash) is sending the event multiple times to elasticsearch.

I assume it has to do something with Logstash trying to be fault-tolerant.
It might be possible that when ElasticSearch went down, it indexed the last bulk request (or a part of it), but Logtash never received acknowledgement of a successful operation, so it kept trying to reprocess that bulk request.
When eventually ES came back up, it received the request again but treating it as a new batch of documents (since there was no unique identifier) causing some documents to be indexed twice.

In my case I am seeing the events multiple times not just twice. some events even got to elasticsearch 6 times.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.