Loadbalance duplicating the events using logstash

avinash · September 23, 2016, 6:50am

I have configuration as filebeat with loadbalance true & two logstash hosts i.e. logstash1 & logstash2 which are input to elasticsearch.
The filebeat configurtion for logstash is as below
prospectors:
-
paths:
- /home/testLogs/temp.log
spool_size: 1
publish_async: true

Logstash as output

logstash:
hosts: ["10.186.187.44:5044", "10.186.187.6:5044"]
worker: 1
loadbalance: true

The logstash input on both logstash I set as below
input {
beats {
port => 5044
congestion_threshold => 100000
}
}
output {
stdout {codec => rubydebug}
elasticsearch {
hosts => "ipaddress:9200"
index => "testidx"
}
}

However when I echo some lines into file temp.log, I see that in the elasticsearch processed lines are duplicated docs, it looks to me that both logstash are processing all the lines. and on stdout of both logstash sometime I see forex line1 & line2 processed by both logstash, or sometimes line1 processed by logstash1 & line1, line2, line2 processed by logstash2
However even making loadbalance=flase I still see the duplicated docs. & in stdout I sed line1 & line2 processed twice either by logstash1 or logstash2.

So just wanted to know is my configuration wrong that the lines are not distributed on logstashes or is it like its behavior or bug

steffens · September 29, 2016, 9:42am

can you share filebeat configuration + filebeat debug logs + exact shell commands you use to update your file.

The outputs in filebeat do normally not duplicate content. But with send-at-least-once semantics in filebeat, if one output did not ACK the events being published, events might be send again to same or another output.

To me it sounds more like filebeat is resending the complete file (maybe inode changes). But without filebeat and logstash logs I can not say anything.

avinash · September 30, 2016, 9:55am

Hi Seffens,

It looks, it stopped getting duplicate docs, after my system got crashed & I had to restart again. So now the problem seems to be fixed but I still wonder why I was getting the duplicate logs after providing two hosts & loadbalance=true or even false. I dont have older filebeat debug logs which was causing the duplicate events. Even previously i restarted filebeat multiple times which was still causing duplication. I hope it will not start duplicating again.

shog · September 30, 2016, 7:46pm

Hi @avinash,

Out of curiosity, do you have increased congestion_threshold on your beats input?

avinash · October 3, 2016, 4:35am

HI @shog,
From the start I used congestion_threshold very high value i.e. congestion_threshold => 100000, and I didnt change it before or after fixing this issue.

steffens · October 4, 2016, 12:40pm

beats do not duplicate events, but beats have do resend events if the output is not ACKing events (e.g. network failures or timeouts). If logstash/elasticsearch/redis/kafka did process events without the reply ever being received by beats, beats have to resend. Without reply you can not tell if data have been processed or not. This can even happen if just one logstash host is configured and/or load balancing is disabled. It's due to send-at-least-once semantics. Deduplication must either be handled on protocol level or potentially in ES (adds indexing overhead though) by adding some kind of sequence number/id/key. With load-balancer potentially re-balancing failed send attempts it can not be solved on protocol itself unless rebalancing is disabled (potentially blocking the beat if one sink becomes unavailable).

Related issue about generating some UUID per event: https://github.com/elastic/beats/issues/1492

per event UUIDs should be used with on-disk persistent queues in order remove the chance of duplicate events being send due to new UUID being generated between beat restarts: https://github.com/elastic/beats/issues/575

But this is some future talk not yet available in beats.

shog · October 4, 2016, 1:31pm

Hi @steffens,

So, could lowering congestion_threshold on beats input help here?

My understanding is that logstash output in filebeat has a timeout of 30 secs by default, but @avinash is also using a beats input in LS with congestion_threshold set to 100k seconds.

steffens · October 4, 2016, 1:51pm

A lower congestion_threshold can worsen the problem, as logstash might close connection with subset of events published still being actively processed.

There is still a chance the timeout in beats did trigger (30s without response is very well possible if logstash is blocked by outputs) or we're indeed seeing some network issues. Increasing timeout to something like 5 minutes or 10 minutes might improve situation if logstash regularly has to wait (e.g. for it's output not responding due to GC-phase in remote server).

Topic		Replies	Views
Filebeat and logstash-output with loadbalance Beats filebeat	2	446	November 29, 2018
Loadbalance Logstash without double ES-data? Logstash	3	849	October 27, 2017
Filebeat sending events to Logstash output more than once Beats filebeat	1	468	September 20, 2022
Filebeat, Logstash, Elasticsearch robustness and duplicated documents Beats filebeat	10	4401	June 10, 2016
How to make Filebeat send events to multiple logstash in roundrobin Beats filebeat	2	5513	December 24, 2015

Loadbalance duplicating the events using logstash

Logstash as output

Related topics