S3 input plugin: since_db doesn't work properly: high CPU usage of logstash when more files in folder

Issue:

CPU Usage too high when thousands of files in AWS S3 log folder (s3 input plugin in use)

ELK/Logstash setup

AWS S3 bucket is scanned by S3 input plugin and it sends data to Logstash

Issue description:

Seems to be very difficult for S3 plugin to handle thousands/millions of files in folder with logs in S3 bucket. Logstash fetches all files and parsing by Grok, this operation seems to be pretty OK whit CPU, unfortunately when number of logs (records in folder) is rising by time, the CPU usage of logstash is growing too.

When 30 days of logs in the bucket, everything is quite fine and CPU usage of logstash process is around 70%, when 90 days of log files are present in the logs folder, the CPU usage of Logstash process is rising up to 300%.

We only keep 14 days in our elasticsearch instance.

Configuration of logstash.conf (s3 input plugin included)

input
{
    s3
    {
        bucket => "<bucket_name>"
        prefix => "production/lb/<path>/elasticloadbalancing/eu-west-1/"
        region => "eu-west-1"
        type => "alblogs"
        codec => plain
        sincedb_path => "/opt/<path>/elasticsearch/plugins/repository-s3/alblogs.txt"
        secret_access_key => "<secret>"
        access_key_id => "<access_key_id>"
    }

...

 filter {
   if [type] == "alblogs" {
      grok {
         match => ["message", "%{TIMESTAMP_ISO8601:timestamp} %{NOTSPACE:loadbalancer} %{IP:client_ip}:%{NUMBER:client_port:int} (?:%{IP:backend_ip}:%{NUMBER:backend_port:int}|-) %{NUMBER:request_processing_time:float} %{NUMBER:backend_processing_time:float} %{NUMBER:response_processing_time:float} (?:%{NUMBER:elb_status_code:int}|-) (?:%{NUMBER:backend_status_code:int}|-) %{NUMBER:received_bytes:int} %{NUMBER:sent_bytes:int} \"(?:%{WORD:verb}|-) (?:%{GREEDYDATA:request}|-) (?:HTTP/%{NUMBER:httpversion}|-( )?)\" \"%{DATA:userAgent}\"( %{NOTSPACE:ssl_cipher} %{NOTSPACE:ssl_protocol})?"]
        match => [ "request", "%{UUID:event_uuid}" ]
      }

...

  if [type] == "s3_production" {
    grok {
        match => ["message", "%{NOTSPACE:s3_owner}[ \t](-|%{HOSTNAME:s3_bucket})[ \t]\[%{HTTPDATE:timestamp}\][ \t]%{IP:s3_remote_ip}[ \t]%{NOTSPACE:Requester}[ \t]%{NOTSPACE:RequesterID}[ \t]%{NOTSPACE:s3_operation}[ \t]%{NOTSPACE:s3_key}[ \t]%{NOTSPACE:request_method}[ \t]%{NOTSPACE:request_url}[ \t]%{NOTSPACE:request_protocol}[ \t]%{NUMBER:HTTP_status}[ \t]%{NOTSPACE:s3_errorCode}[ \t]%{NOTSPACE:s3_bytesSent}[ \t]%{NOTSPACE:s3_objectSize}[ \t]%{NUMBER:s3_totalTime}[ \t]%{NOTSPACE:s3_turnaroundTime}[ \t]\"%{NOTSPACE:Referrer}\"[ \t]\"%{GREEDYDATA:UserAgent}\"[ \t]%{NOTSPACE:s3_versionId}[ \t]%{NOTSPACE:s3_hostId}[ \t]%{NOTSPACE:s3_signarureVersion}[ \t]%{NOTSPACE:s3_cipherSuite}[ \t]%{NOTSPACE:s3_authType}[ \t]%{HOSTNAME:s3_hostHeader}[ \t]%{NOTSPACE:s3_TLSversion}"]

    add_tag => [ "production" ]
    tag_on_failure => [ "S3.EXPIRE.OBJECT" ]
    }
    mutate {
        remove_field => [ "message" ]
    }
  }

...

output
{

   if [type] == "alblogs" {
    elasticsearch {
        hosts => ["127.0.0.1:9200"]
        index => "alblogs-%{+YYYY.MM.dd}"
       }
   if [type] == "s3_production" {
    elasticsearch {
        hosts => ["127.0.0.1:9200"]
        index => "s3-production-%{+YYYY.MM.dd}"
       }
    }

As you can see above, in the configuration, we use since_db parameter, and I can see that timestamp and content of the file changes over time, but unfortunately, it seems that it doesn't apply for "reading/scanning" whole bucket (folder).

Once I tried to used watch_for_new_files parameter in the file, but it does one listing and fetch all files, parse all files and then it stops to look for new files.

Anybody some idea?

As I can see, it could be whole story, and it is probably not solved yet [https://github.com/logstash-plugins/logstash-input-s3/issues/128].

Any help, suggestion would be appreciated! Thanks!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.