Loadbalance duplicating the events using logstash


(Avinash) #1

I have configuration as filebeat with loadbalance true & two logstash hosts i.e. logstash1 & logstash2 which are input to elasticsearch.
The filebeat configurtion for logstash is as below
prospectors:
-
paths:
- /home/testLogs/temp.log
spool_size: 1
publish_async: true

Logstash as output

logstash:
hosts: ["10.186.187.44:5044", "10.186.187.6:5044"]
worker: 1
loadbalance: true

The logstash input on both logstash I set as below
input {
beats {
port => 5044
congestion_threshold => 100000
}
}
output {
stdout {codec => rubydebug}
elasticsearch {
hosts => "ipaddress:9200"
index => "testidx"
}
}

However when I echo some lines into file temp.log, I see that in the elasticsearch processed lines are duplicated docs, it looks to me that both logstash are processing all the lines. and on stdout of both logstash sometime I see forex line1 & line2 processed by both logstash, or sometimes line1 processed by logstash1 & line1, line2, line2 processed by logstash2
However even making loadbalance=flase I still see the duplicated docs. & in stdout I sed line1 & line2 processed twice either by logstash1 or logstash2.

So just wanted to know is my configuration wrong that the lines are not distributed on logstashes or is it like its behavior or bug


(Steffen Siering) #2

can you share filebeat configuration + filebeat debug logs + exact shell commands you use to update your file.

The outputs in filebeat do normally not duplicate content. But with send-at-least-once semantics in filebeat, if one output did not ACK the events being published, events might be send again to same or another output.

To me it sounds more like filebeat is resending the complete file (maybe inode changes). But without filebeat and logstash logs I can not say anything.


(Avinash) #3

Hi Seffens,

It looks, it stopped getting duplicate docs, after my system got crashed & I had to restart again. So now the problem seems to be fixed but I still wonder why I was getting the duplicate logs after providing two hosts & loadbalance=true or even false. I dont have older filebeat debug logs which was causing the duplicate events. Even previously i restarted filebeat multiple times which was still causing duplication. I hope it will not start duplicating again.


#4

Hi @avinash,

Out of curiosity, do you have increased congestion_threshold on your beats input?


(Avinash) #5

HI @shog,
From the start I used congestion_threshold very high value i.e. congestion_threshold => 100000, and I didnt change it before or after fixing this issue.


(Steffen Siering) #6

beats do not duplicate events, but beats have do resend events if the output is not ACKing events (e.g. network failures or timeouts). If logstash/elasticsearch/redis/kafka did process events without the reply ever being received by beats, beats have to resend. Without reply you can not tell if data have been processed or not. This can even happen if just one logstash host is configured and/or load balancing is disabled. It's due to send-at-least-once semantics. Deduplication must either be handled on protocol level or potentially in ES (adds indexing overhead though) by adding some kind of sequence number/id/key. With load-balancer potentially re-balancing failed send attempts it can not be solved on protocol itself unless rebalancing is disabled (potentially blocking the beat if one sink becomes unavailable).

Related issue about generating some UUID per event: https://github.com/elastic/beats/issues/1492

per event UUIDs should be used with on-disk persistent queues in order remove the chance of duplicate events being send due to new UUID being generated between beat restarts: https://github.com/elastic/beats/issues/575

But this is some future talk not yet available in beats.


#7

Hi @steffens,

So, could lowering congestion_threshold on beats input help here?

My understanding is that logstash output in filebeat has a timeout of 30 secs by default, but @avinash is also using a beats input in LS with congestion_threshold set to 100k seconds.


(Steffen Siering) #8

A lower congestion_threshold can worsen the problem, as logstash might close connection with subset of events published still being actively processed.

There is still a chance the timeout in beats did trigger (30s without response is very well possible if logstash is blocked by outputs) or we're indeed seeing some network issues. Increasing timeout to something like 5 minutes or 10 minutes might improve situation if logstash regularly has to wait (e.g. for it's output not responding due to GC-phase in remote server).


(system) #9

This topic was automatically closed after 21 days. New replies are no longer allowed.