Filebeat, Logstash, Elasticsearch robustness and duplicated documents

mr_caos · February 25, 2016, 1:06pm

For logging in AWS EC2, I'm testing the robustness of the chain Filebeat, Logstash, Elasticsearch. I have one AMI with an appplication + Filebeat, one with Logstash and one with Elastisearch + Kibana. With the application running I try to reboot one of these 3 machines and see what happens when it's back available.

The good news is that I never loose any line of log. The less good one is that most of the times I end up having duplicated logs in Elasticsearch. Typically I generate X lines (let's say 100K) and I have X + few hundreds in ES.

Why does it happen? Is there a way to prevent it (or remove them afterwards) knowing that my log lines have no unique id?

andrewkroh · February 25, 2016, 11:31pm

Filebeat has "at-least-once" semantics. Depending on your configuration, at any given time could have X log events in flight that have not been acknowledged. If you hard kill some part of the system then those log events that were in flight will go unacknowledged. So when the system comes back online it has no choice but to resend the un-acknowledged log lines so that no data is lost.

steffens · February 26, 2016, 7:18am

one easy fix for duplicates lines is by giving each line a unique id in logstash e.g. based in timestamp and hash of line. Drawback is duplicate lines with same timestamp will appear only once in index (kinda acts like deduplication).

mr_caos · February 26, 2016, 8:09am

Thank you for your answer. There is another phenomenon that I've observed and I'm still a little puzzled about.
As said, I have my chain Filebeat, Logstash, Elasticsearch. With logs coming in, I reboot the ES machine. When it's back available I have a few duplicates (few hundreds over plus than 100K lines). Just after that, I reboot the FB machine. When it's back it inserts tens of thousands of old duplicates.
I've tried this many times and seems to be systematic. For completeness, I forgot to mention that there's a elastic load balancer between FB and LS.

steffens · February 26, 2016, 1:46pm

FB uses a registry file to remember state of last lines being send. This registry file is rewritten after having received an ACK from logstash (request to update the registry file is forwarded to another worker). The registry must be written in order, that is if load-balancing is enabled in filebeat (as you are using a load-balancer) events being ACKed might still pile up in filebeat due to a slower logstash not having ACKed events yet.

Can you share your filebeat config + more details about your setup? Have you tried to run tests with elastic load balancer?

mr_caos · February 26, 2016, 3:12pm

Here my Filebeat config:

filebeat:
    prospectors:
    -
        paths:
            - /var/lib/docker/containers/*/*.log      
        input_type: log      
        fields:         
            application: log-gen      
        fields_under_root: true 
output:  
    logstash:    
        enabled: true    
        hosts: ["XXX.elb.amazonaws.com:5044"]

And my setup looks like:

Where all elements run into Docker containers.
The 2 Logstashes have very basic configuration.
(I know that with such a setup only 1 LS will be working at any time. It's just a test.)

steffens · February 26, 2016, 5:57pm

it's running in sync mode. I'd expect a few thousand log lines to be reindexed on filebeat machine shutdown. Before restarting filebeat you should check the registry file. Is file present? Is content valid json? Registry file is updated by first writing complete registry to temporary file and using mv to replace registry file with updated registry file (change inode).

Check offset in registry file (offset is given in bytes), to locate position filebeat will restart from. Using the timestamp + line number and timestamp of last line being indexed + line number we can tell how many lines will be reindexed.

ruflin · February 29, 2016, 2:30pm

Could it be that this was related to the following bug? https://github.com/elastic/beats/pull/1060 Was it rotated files that were dpulicated?

gcherneski · April 24, 2016, 11:56pm

I am doing a similar configuration with FB --> ELB --> LS/LS --> ELB --> ES/ES/ES but as you can see I am sending the Logstash output to a Elasticsearch cluster. How do you have your ELB configured (session affinity, session timeout)?

Maxwell_Flanders · June 9, 2016, 3:52pm

This looks to be exactly the situation I am encountering, since we added in load-balancing, we have begun receiving duplicate lines from our filebeats instance across two logstash servers. Do you know if there are things I can do to either speed up the rate that an instance communicates back, or tell instances to wait for each other??

steffens · June 10, 2016, 10:20am

any errors in logstash logs? If logstash can not process events (does not ACK), the events might be send to another instance.

Topic		Replies	Views
Duplicate events with filebeat -> logstash -> elasticsearch pipeline Logstash	6	2353	November 28, 2017
Duplicate events in filebeat + logstash + elasticsearch pipeline Logstash	2	1913	July 6, 2017
Duplicate documents using Filebeat Beats	3	2411	July 15, 2016
Duplicated date in my elastic Logstash	6	315	November 1, 2022
Duplication in Filebeat to Elasticsearch data pushing Beats filebeat	5	702	December 28, 2017

Filebeat, Logstash, Elasticsearch robustness and duplicated documents

Related topics