Logstash gives OOM & CPU Usage too high when used with S3 Input plugin

Logstash gives out of Memory when S3 plugin is used for a bucket which has already existing tones of files.




How much memory you configured for the Logstash heap in the jvm.options file? And what does your configuration pipeline looks like?

The s3 input has a lot of issues when dealing with buckets with a lot of files, but I would expected it to be slow, not thrown an OOM

I am using Heap Size of 5GB, in an ECS container of 1.7vCPU & 8GB memory.

Pipeline Config:

input {
  s3 {
    region => "${REGION}"
    bucket => "${S3_BUCKET}"
    interval => "120"
    type => "ecs"
   # delete => true
    codec => "json"
    prefix => "2023/"
    gzip_pattern => ".*?$"
    sincedb_path => "/usr/share/logstash/data/plugins/inputs/s3/since_db_file"
  }
}
filter{
    mutate {
      gsub => ["[event][original]", "}{", "},{"]
      gsub => ["[event][original]", "^{", "[{"]
      gsub => ["[event][original]", "}$", "}]"]
    }
    json {
      source => "[event][original]"
      target => "json_message"
    }
    split {
      field => "json_message"
    }
    split{
      field => "[json_message][logEvents]"
    }
    mutate{
      add_field => {
        "log-group" => "%{[json_message][logGroup]}"
        "log-stream" => "%{[json_message][logStream]}"
        "raw-application-log" => "%{[json_message][logEvents][message]}"
      }
    }
    if [raw-application-log] =~ "^{.*}$" {
      json{
        source => "raw-application-log"
        target => "json-application-log"
      }
      if [json-application-log][child_task_id]{
          mutate{
            add_field => {
              "child-task-id" => "%{[json-application-log][child_task_id]}"
            }
          }
      }
      if [json-application-log][child_task_name]{
          mutate{
            add_field => {
              "child-task-name" => "%{[json-application-log][child_task_name]}"
            }
          }
      }
#      if [json-application-log][event]{
#          mutate{
#            add_field => {
#              "event" => "%{[json-application-log][event]}"
#            }
#          }
#      }
      if [json-application-log][level]{
          mutate{
            add_field => {
              "log-level" => "%{[json-application-log][level]}"
            }
          }
      }
      if [json-application-log][logger]{
          mutate{
            add_field => {
              "logger" => "%{[json-application-log][logger]}"
            }
          }
      }
      if [json-application-log][parent_task_id]{
          mutate{
            add_field => {
              "parent-task-id" => "%{[json-application-log][parent_task_id]}"
            }
          }
      }
      if [json-application-log][request_id]{
          mutate{
            add_field => {
              "request-id" => "%{[json-application-log][request_id]}"
            }
          }
      }
      if [json-application-log][task_id]{
          mutate{
            add_field => {
              "task-id" => "%{[json-application-log][task_id]}"
            }
          }
      }
      if [json-application-log][tenant]{
          mutate{
            add_field => {
              "tenant" => "%{[json-application-log][tenant]}"
            }
          }
      }
      if [json-application-log][tenant_db_alias]{
          mutate{
            add_field => {
              "tenant-db-alias" => "%{[json-application-log][tenant_db_alias]}"
            }
          }
      }
      if [json-application-log][user_id]{
          mutate{
            add_field => {
              "user-id" => "%{[json-application-log][user_id]}"
            }
          }
      }
      if [json-application-log][timestamp]{
          mutate{
            add_field => {
              "log-timestamp" => "%{[json-application-log][timestamp]}"
            }
          }
      }
    }
    mutate{
      remove_field => [
        "[event][original]",
        "[json_message]",
        "[logEvents]",
        "[messageType]",
        "[owner]",
        "[logGroup]",
        "[logStream]",
        "[subscriptionFilters]",
        "[json-application-log]",
        "[type]"
      ]
    }
    grok {
      match => [
        "log-group",
        "/ecs/%{GREEDYDATA:environment}/%{GREEDYDATA:service}"
      ]
    }
#    sleep {
#      time => "1"   # Sleep 1 second
#      every => 300   # on every 100th event
#    }
}
output {
  elasticsearch {
    hosts => ["http://${ELASTICSEARCH_HOST}:80"]
    index => "%{environment}"
  }
  stdout { codec => json }
}

Here, is the Logstash metrics. Can any one explain why events received is much less than events emitted.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.