[SOLVED]Getting duplicate entries with Filebeat to Logstash Setup

Hi,

My setup has been working perfectly for a week now but today I noticed that an entire logfile got submitted twice (the whole content was written around the same time so it was probably transmitted as one batch).

The only Thing I noticed was the folllowing in the filebeat log:

2018-11-07T07:45:09.878+0100	INFO	log/harvester.go:251	Harvester started for file: C:\logs\download.log
2018-11-07T07:45:10.879+0100	ERROR	logstash/async.go:256	Failed to publish events caused by: write tcp 10.122.65.225:54214->10.122.65.223:5044: wsasend: An existing connection was forcibly closed by the remote host.
2018-11-07T07:45:11.881+0100	ERROR	pipeline/output.go:121	Failed to publish events: write tcp 10.122.65.225:54214->10.122.65.223:5044: wsasend: An existing connection was forcibly closed by the remote host.
2018-11-07T07:45:11.881+0100	INFO	pipeline/output.go:95	Connecting to backoff(async(tcp://10.122.65.223:5044))
2018-11-07T07:45:11.881+0100	INFO	pipeline/output.go:105	Connection to backoff(async(tcp://10.122.65.223:5044)) established
2018-11-07T07:45:26.983+0100	INFO	[monitoring]	log/log.go:141	Non-zero metrics in the last 30s	{"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":16146},"total":{"ticks":21949,"value":21949},"user":{"ticks":5803}},"info":{"ephemeral_id":"9a74d02c-78cc-4eff-8223-6200a82df1a7","uptime":{"ms":598530415}},"memstats":{"gc_next":6522208,"memory_alloc":3494712,"memory_total":9663826592,"rss":4136960}},"filebeat":{"events":{"added":23,"done":23},"harvester":{"open_files":2,"running":2,"started":1},"input":{"log":{"files":{"renamed":1}}}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"acked":21,"batches":3,"failed":20,"total":41},"read":{"bytes":12},"write":{"bytes":1165,"errors":1}},"pipeline":{"clients":1,"events":{"active":0,"filtered":2,"published":21,"retry":40,"total":23},"queue":{"acked":21}}},"registrar":{"states":{"current":43,"update":23},"writes":{"success":3,"total":3}}}}}
2018-11-07T07:45:39.902+0100	INFO	log/harvester.go:251	Harvester started for file: C:\logs\production.log

So it would appear that there was some network issue but the events seem to have been submitted despite that.

Anyway to prevent this in future ?
I mean we should find out what the network issue was on our end but they can happen and I'd hate to end-up with duplicated logs everytime it does. I may not always notice either.

I would not like to set some field in our logs as the id of the documents in ES as I don't really see a candidate field and even if I did, I'd be hesitant to apply deduplication.

I hope there is a way to eleviate this issue.

Filebeats config:

filebeat.inputs:
- type: log
  enabled: true

  paths:
    - C:\logs\*\*
  close_inactive: 5h

  multiline.pattern: '^[0-9]{4}-[0-9]{2}-[0-9]{2}'
  multiline.negate: true
  multiline.match: after

filebeat.config.modules:

  path: ${path.config}/modules.d/*.yml

  reload.enabled: false

setup.template.settings:
  index.number_of_shards: 3

setup.kibana:

output.logstash:
  hosts: ["10.122.65.223:5044"]

Logstsash config:

input {
  beats {
    port => 5044
    host => "0.0.0.0"
  }
}

filter {
  grok {
    match => { "message" => ["\A%{TIMESTAMP_ISO8601}\s-\s[a-zA-Z0-9]+\s-\s%{LOGLEVEL:log-level}\s-\sBestelltnummer:\s(?<bestellnummer>[0-9]{9}),\sILN:\s(?<iln>[0-9]{13}),\sKundenNr\. (?<kundennr>[0-9]{6}),\s(?<stueckzahl>[0-9]{1,3})\sST,\sArtNr.:\s(?<artikelnr>[0-9]{13,14})", "\A%{TIMESTAMP_ISO8601}\s-\s[a-zA-Z0-9]+\s-\s%{LOGLEVEL:log-level}\s-\s%{DATA}:\s(?<ris-docid>%{GREEDYDATA})", "\A%{TIMESTAMP_ISO8601}\s-\s[a-zA-Z0-9]+\s-\s%{LOGLEVEL:log-level}\s-\s"] }
  }
}

filter {
  mutate {
    convert => { "stueckzahl" => "integer" }
  }
}

output {
  elasticsearch {
    hosts => ["myhostinthecloud:9243"]
    #index => "%{[@metadata][beat]}-%{[@metadata][version]}-%{+YYYY.MM.dd}"
    user => xxx
    password => xxx
  }
}

/EDIT:

I did add another config yesterday in logstash/conf.d but I don't see how it would that would be causing it:

input {
   gelf {
    port => 12201
    type => gelf
    host => "0.0.0.0"
  }
}

output {
  elasticsearch {
    hosts => ["myhostinthecloud:9243"]
        #index => "%{[@metadata][beat]}-%{[@metadata][version]}-%{+YYYY.MM.dd}"
        user => xxx
        password => xxx
  }
}

Filebeat and Logstash offer an at-least-once delivery guarantee, which means that if there are network or cluster issues, data will be resent. You can not achieve an exactly-once delivery guarantee, but can prevent duplicates once the data is written to Elasticsearch. Have a look at the following two blog posts for further details:

1 Like

Thank you!

The first article was really helpful and I had added a fingerprint to my config but then things got even weirder. There were entries with ids clearly created with a fingerprint but then there were still duplicate entries with ids looking like they were created by ES.

This made me suspricious so I renamed the new config I mentioned with the gelf input to deactivate it. I restarted logstash, deleted the filebeat registry and lo and behold, no more duplicates.
So it seems like the 2nd config is causing this. Any idea how this is possible and what I can do about it ?
The logfiles are only processed by filebeats and unless that is sending them via gelf too, I have no idea what's going on.

1 Like

If you have more than one config file in the directory, they are all concatenated, which means that all inputs will be processed by all filters and go to all outputs unless you control the flow using conditionals.

1 Like

Okay but I don't see how that explains my issue. Do yo have an idea why the gelf input causes duplicates without even receiving anything ?

You have 2 elasticsearch output plugins which both will process all data from all input plugins. This happens even if there is no self data coming in.

ohhhh, got it. Sorry, I was a little dense there :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.