Filebeat, Logstash, Elasticsearch robustness and duplicated documents

For logging in AWS EC2, I'm testing the robustness of the chain Filebeat, Logstash, Elasticsearch. I have one AMI with an appplication + Filebeat, one with Logstash and one with Elastisearch + Kibana. With the application running I try to reboot one of these 3 machines and see what happens when it's back available.

The good news is that I never loose any line of log. The less good one is that most of the times I end up having duplicated logs in Elasticsearch. Typically I generate X lines (let's say 100K) and I have X + few hundreds in ES.

Why does it happen? Is there a way to prevent it (or remove them afterwards) knowing that my log lines have no unique id?

Filebeat has "at-least-once" semantics. Depending on your configuration, at any given time could have X log events in flight that have not been acknowledged. If you hard kill some part of the system then those log events that were in flight will go unacknowledged. So when the system comes back online it has no choice but to resend the un-acknowledged log lines so that no data is lost.

one easy fix for duplicates lines is by giving each line a unique id in logstash e.g. based in timestamp and hash of line. Drawback is duplicate lines with same timestamp will appear only once in index (kinda acts like deduplication).

Thank you for your answer. There is another phenomenon that I've observed and I'm still a little puzzled about.
As said, I have my chain Filebeat, Logstash, Elasticsearch. With logs coming in, I reboot the ES machine. When it's back available I have a few duplicates (few hundreds over plus than 100K lines). Just after that, I reboot the FB machine. When it's back it inserts tens of thousands of old duplicates.
I've tried this many times and seems to be systematic. For completeness, I forgot to mention that there's a elastic load balancer between FB and LS.

FB uses a registry file to remember state of last lines being send. This registry file is rewritten after having received an ACK from logstash (request to update the registry file is forwarded to another worker). The registry must be written in order, that is if load-balancing is enabled in filebeat (as you are using a load-balancer) events being ACKed might still pile up in filebeat due to a slower logstash not having ACKed events yet.

Can you share your filebeat config + more details about your setup? Have you tried to run tests with elastic load balancer?

Here my Filebeat config:

filebeat:
    prospectors:
    -
        paths:
            - /var/lib/docker/containers/*/*.log      
        input_type: log      
        fields:         
            application: log-gen      
        fields_under_root: true 
output:  
    logstash:    
        enabled: true    
        hosts: ["XXX.elb.amazonaws.com:5044"]

And my setup looks like:

Where all elements run into Docker containers.
The 2 Logstashes have very basic configuration.
(I know that with such a setup only 1 LS will be working at any time. It's just a test.)

it's running in sync mode. I'd expect a few thousand log lines to be reindexed on filebeat machine shutdown. Before restarting filebeat you should check the registry file. Is file present? Is content valid json? Registry file is updated by first writing complete registry to temporary file and using mv to replace registry file with updated registry file (change inode).

Check offset in registry file (offset is given in bytes), to locate position filebeat will restart from. Using the timestamp + line number and timestamp of last line being indexed + line number we can tell how many lines will be reindexed.

Could it be that this was related to the following bug? https://github.com/elastic/beats/pull/1060 Was it rotated files that were dpulicated?

I am doing a similar configuration with FB --> ELB --> LS/LS --> ELB --> ES/ES/ES but as you can see I am sending the Logstash output to a Elasticsearch cluster. How do you have your ELB configured (session affinity, session timeout)?

This looks to be exactly the situation I am encountering, since we added in load-balancing, we have begun receiving duplicate lines from our filebeats instance across two logstash servers. Do you know if there are things I can do to either speed up the rate that an instance communicates back, or tell instances to wait for each other??

any errors in logstash logs? If logstash can not process events (does not ACK), the events might be send to another instance.