Duplicate logs - logstash-input-s3?

Hi all,

I'm using LS2.1.1 with ES2.1.1, in AWS, to log various sources within my environment.

One of these sources is AWS CloudTrail.

I've noticed a strange problem when trying to optimise my elasticsearch cluster, by reducing the indices and shards, more details here.

So I've optimised my cluster by reducing both my total indices and total shards. Good news so far.

My input used for logstash-input-s3:
input { s3 { bucket => "my-logs-cloudtrail" delete => false interval => 60 # seconds prefix => "AWSLogs/MYACCOUNTID/CloudTrail/" type => "cloudtrail" codec => "cloudtrail" credentials => "/etc/logstash/s3_credentials.ini" sincedb_path => "/opt/logstash_cloudtrail/MYACCOUNTID-sincedb" } }

My filter used to parse the incoming logs:
filter { if [userIdentity][accountId] =~ "MYACCOUNTID" { mutate { add_field => [ "accountName", "mylogs-aws-development" ] } } }

My output used to send to Elasticsearch:
output { if [type] == "cloudtrail" { elasticsearch { hosts => "MYESCLUSTER" index => "logstash-cloudtrail" } } else { elasticsearch { hosts => "MYESCLUSTER" index => "wtf-are-these-logs" } } stdout { codec => "rubydebug" } }

I noticed that my filter is working as, as it was before, however, something odd started to happen:

-Document count continues to grow at nearly the same rate.
-I restarted the cluster logging a few times.
-I triple checked my input, filter, and output.

Then I started to compare events from each index, and noticed:

-Sample events were identical.
-Events in wtf-are-these-logs index are also labelled type: cloudtrail
-The frequency of events, displayed using kibana, is almost identical.

I cant explain this.

Has anyone else seen this before?
Can anyone assist with an explanation or how I can stop this?

Make sure you don't have any extra files in /etc/logstash/conf.d. Logstash will read every single file and effectively concatenate them.

thanks @magnusbaeck, I thought to check that right away.

Does it matter that my input and filter were in the same file, and that my output in a separate file?

The current file structure I use is:
10-inputfile.conf (contains my input and filters)
30-outputfile.conf (contains my output configuration)

I have just now modified this to look like (How it was configured sometime ago):
10-inputfile.conf (contains my input configuration)
20-filterfile.conf (contains my filter configuration)
30-outputfile.conf (contains my output configuration)

I have considered putting the entire input => filter => output pipeline in the same file, but I am unsure if this makes any difference.

The only thing that matters is the internal order of all filters. Having multiple files in a directory is exactly equivalent to doing cat /etc/logstash/conf.d/* > logstash.conf and pointing Logstash to logstash.conf.

In that case, I have no idea why this is happening.

Everything appears to be ticking over nicely, apart from all the duplicate entries.

My concern is that this is increasing disk usage, I/O, etc.

@magnusbaeck I have a question, but if you have time I wanted to ask your opinion before testing.

Is there a difference between these 2?
Would this possibly stop my duplication issue?

Original:
output { if [type] == "cloudtrail" { elasticsearch { hosts => "MYESCLUSTER" <======== REMOVE HERE
index => "logstash-cloudtrail" } } else { elasticsearch { hosts => "MYESCLUSTER" <======== REMOVE HERE
index => "wtf-are-these-logs" } } stdout { codec => "rubydebug" } }

Revised:
output { if [type] == "cloudtrail" { elasticsearch { index => "logstash-cloudtrail" } } else { elasticsearch { index => "wtf-are-these-logs" } } elasticsearch { <=============== ADD HERE
hosts => "MYESCLUSTER" <======== ADD HERE
} <============================ ADD HERE
stdout { codec => "rubydebug" } }

No, this doesn't make sense.

Yeah, sorry, I just literally finished testing it, and it doesn't work.

My logic is I'm wondering if the output is registering twice?

You can start Logstash with --debug to see exactly what configuration Logstash loads.