S3 Input Plugin Sincedb Time

I am using the Logstash S3 Input Plugin to read the gz file in the S3 bucket and ingest into Elasticsearch.

Here is the Conf file under /etc/logstash/conf.d:

input {
s3 {
    aws_credentials_file => "/etc/logstash/aws_credential.yml"
    #role_arn => "arn:aws:iam::333333333:role/dddd-elk-app"
    bucket => "logrepo"
    prefix => "403-forbidden/"
    region => "ap-southeast-1"
    tags => ["s3","403"]
  } 
s3 {
    aws_credentials_file => "/etc/logstash/aws_credential.yml"
    #role_arn => "arn:aws:iam:: 333333333:role/dddd-elk-app"
    bucket => "logrepo"
    prefix => "access/"
    region => "ap-southeast-1"
    tags => ["s3","access"]
  }
}
filter {
mutate {
    add_field => {
      "source_file" => "%{[@metadata][s3][key]}"
    }
}
if "403" in [tags]{
grok {
  match => [
    "message",
    "`%{TIMESTAMP_ISO8601:logTimestamp} %{DATA:ip} %{DATA:session_id} %{GREEDYDATA:error_message}"
  ]
}
date {
    match => [ "logTimestamp", "ISO8601" ]
    target => "@timestamp"
    locale => "en"
    timezone => "UTC"
  }

}
}
output {
 if "403" in [tags]{
  elasticsearch {
    hosts => ["11.11.11.11:9200"]
    index => "403-%{+YYYY.MM.dd}"
  }
}
else if "access" in [tags]{
  elasticsearch {
    hosts => ["11.11.11.11:9200"]
    pipeline => "filebeat-7.0.0-apache-access-default"
    index => "access-%{+YYYY.MM.dd}"
  }
}
}

The source files in S3 bucket are the gz file which will be generated around 20-60 seconds minutes each with 5MB each.

I found the message in the Elasticsearch is duplicated (a single message will be ingested into ES for 2-5 times with different _id).

This problem is not found when I put the gz file into S3 bucket one by one.

When I tried to tail the sincedb, I found the time wrote into sincedb is not chronological. Here is the extract when using tail command:

2019-04-24 18:13:21 +0000
2019-04-24 18:13:22 +0000
2019-04-24 18:12:49 +0000
2019-04-24 18:12:50 +0000
2019-04-24 18:13:23 +0000

I checked there are subfolders inside s3:/403-forbidden/ and I suspect:

  1. The different timestamps wrote into sincedb due to different files are read under those sub-directories.
  2. Since the "past time" is written into the sincedb, therefore the files are w

May I ask for advice on how to troubleshoot this issue?

I did an experiment:

When I tried to edit the sincedb into a past date/time and save it, the logstash will try to read the ingested gz again because the date/time is modified.

So, I think the root cause is why the sincedb is wrongly updated (may be by logstash itself).

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.