Duplicate logs - logstash-input-s3?

plonka2000 · January 14, 2016, 2:37pm

Hi all,

I'm using LS2.1.1 with ES2.1.1, in AWS, to log various sources within my environment.

One of these sources is AWS CloudTrail.

I've noticed a strange problem when trying to optimise my elasticsearch cluster, by reducing the indices and shards, more details here.

So I've optimised my cluster by reducing both my total indices and total shards. Good news so far.

My input used for logstash-input-s3:
input { s3 { bucket => "my-logs-cloudtrail" delete => false interval => 60 # seconds prefix => "AWSLogs/MYACCOUNTID/CloudTrail/" type => "cloudtrail" codec => "cloudtrail" credentials => "/etc/logstash/s3_credentials.ini" sincedb_path => "/opt/logstash_cloudtrail/MYACCOUNTID-sincedb" } }

My filter used to parse the incoming logs:
filter { if [userIdentity][accountId] =~ "MYACCOUNTID" { mutate { add_field => [ "accountName", "mylogs-aws-development" ] } } }

My output used to send to Elasticsearch:
output { if [type] == "cloudtrail" { elasticsearch { hosts => "MYESCLUSTER" index => "logstash-cloudtrail" } } else { elasticsearch { hosts => "MYESCLUSTER" index => "wtf-are-these-logs" } } stdout { codec => "rubydebug" } }

I noticed that my filter is working as, as it was before, however, something odd started to happen:

-Document count continues to grow at nearly the same rate.
-I restarted the cluster logging a few times.
-I triple checked my input, filter, and output.

Then I started to compare events from each index, and noticed:

-Sample events were identical.
-Events in wtf-are-these-logs index are also labelled type: cloudtrail
-The frequency of events, displayed using kibana, is almost identical.

I cant explain this.

Has anyone else seen this before?
Can anyone assist with an explanation or how I can stop this?

magnusbaeck · January 15, 2016, 6:53am

Make sure you don't have any extra files in /etc/logstash/conf.d. Logstash will read every single file and effectively concatenate them.

plonka2000 · January 15, 2016, 9:44am

thanks @magnusbaeck, I thought to check that right away.

Does it matter that my input and filter were in the same file, and that my output in a separate file?

The current file structure I use is:
10-inputfile.conf (contains my input and filters)
30-outputfile.conf (contains my output configuration)

I have just now modified this to look like (How it was configured sometime ago):
10-inputfile.conf (contains my input configuration)
20-filterfile.conf (contains my filter configuration)
30-outputfile.conf (contains my output configuration)

I have considered putting the entire input => filter => output pipeline in the same file, but I am unsure if this makes any difference.

magnusbaeck · January 15, 2016, 9:49am

The only thing that matters is the internal order of all filters. Having multiple files in a directory is exactly equivalent to doing cat /etc/logstash/conf.d/* > logstash.conf and pointing Logstash to logstash.conf.

plonka2000 · January 15, 2016, 10:07am

In that case, I have no idea why this is happening.

Everything appears to be ticking over nicely, apart from all the duplicate entries.

My concern is that this is increasing disk usage, I/O, etc.

plonka2000 · January 15, 2016, 4:11pm

@magnusbaeck I have a question, but if you have time I wanted to ask your opinion before testing.

Is there a difference between these 2?
Would this possibly stop my duplication issue?

Original:
output { if [type] == "cloudtrail" { elasticsearch { hosts => "MYESCLUSTER" <======== REMOVE HERE
index => "logstash-cloudtrail" } } else { elasticsearch { hosts => "MYESCLUSTER" <======== REMOVE HERE
index => "wtf-are-these-logs" } } stdout { codec => "rubydebug" } }

Revised:
output { if [type] == "cloudtrail" { elasticsearch { index => "logstash-cloudtrail" } } else { elasticsearch { index => "wtf-are-these-logs" } } elasticsearch { <=============== ADD HERE
hosts => "MYESCLUSTER" <======== ADD HERE
} <============================ ADD HERE
stdout { codec => "rubydebug" } }

magnusbaeck · January 15, 2016, 5:20pm

No, this doesn't make sense.

plonka2000 · January 15, 2016, 5:22pm

Yeah, sorry, I just literally finished testing it, and it doesn't work.

My logic is I'm wondering if the output is registering twice?

magnusbaeck · January 17, 2016, 4:27pm

You can start Logstash with --debug to see exactly what configuration Logstash loads.

Topic		Replies	Views
Logstash pipeline: events get triplicated Logstash	1	695	July 6, 2017
Duplicate Events Logstash	3	1884	July 6, 2017
Duplicate events in filebeat + logstash + elasticsearch pipeline Logstash	2	1936	July 6, 2017
How to avoid elasticsearch duplicate documents Logstash	6	1703	March 5, 2018
Duplicate record exatly twice Logstash	8	860	April 20, 2018

Duplicate logs - logstash-input-s3?

Related topics