I am using the Logstash S3 Input Plugin to read the gz file in the S3 bucket and ingest into Elasticsearch.
Here is the Conf file under /etc/logstash/conf.d:
input {
s3 {
aws_credentials_file => "/etc/logstash/aws_credential.yml"
#role_arn => "arn:aws:iam::333333333:role/dddd-elk-app"
bucket => "logrepo"
prefix => "403-forbidden/"
region => "ap-southeast-1"
tags => ["s3","403"]
}
s3 {
aws_credentials_file => "/etc/logstash/aws_credential.yml"
#role_arn => "arn:aws:iam:: 333333333:role/dddd-elk-app"
bucket => "logrepo"
prefix => "access/"
region => "ap-southeast-1"
tags => ["s3","access"]
}
}
filter {
mutate {
add_field => {
"source_file" => "%{[@metadata][s3][key]}"
}
}
if "403" in [tags]{
grok {
match => [
"message",
"`%{TIMESTAMP_ISO8601:logTimestamp} %{DATA:ip} %{DATA:session_id} %{GREEDYDATA:error_message}"
]
}
date {
match => [ "logTimestamp", "ISO8601" ]
target => "@timestamp"
locale => "en"
timezone => "UTC"
}
}
}
output {
if "403" in [tags]{
elasticsearch {
hosts => ["11.11.11.11:9200"]
index => "403-%{+YYYY.MM.dd}"
}
}
else if "access" in [tags]{
elasticsearch {
hosts => ["11.11.11.11:9200"]
pipeline => "filebeat-7.0.0-apache-access-default"
index => "access-%{+YYYY.MM.dd}"
}
}
}
The source files in S3 bucket are the gz file which will be generated around 20-60 seconds minutes each with 5MB each.
I found the message in the Elasticsearch is duplicated (a single message will be ingested into ES for 2-5 times with different _id).
This problem is not found when I put the gz file into S3 bucket one by one.
When I tried to tail the sincedb, I found the time wrote into sincedb is not chronological. Here is the extract when using tail command:
2019-04-24 18:13:21 +0000
2019-04-24 18:13:22 +0000
2019-04-24 18:12:49 +0000
2019-04-24 18:12:50 +0000
2019-04-24 18:13:23 +0000
I checked there are subfolders inside s3:/403-forbidden/ and I suspect:
- The different timestamps wrote into sincedb due to different files are read under those sub-directories.
- Since the "past time" is written into the sincedb, therefore the files are w
May I ask for advice on how to troubleshoot this issue?
I did an experiment:
When I tried to edit the sincedb into a past date/time and save it, the logstash will try to read the ingested gz again because the date/time is modified.
So, I think the root cause is why the sincedb is wrongly updated (may be by logstash itself).