CPU usage of Logstash too high when used with S3 Input plugin

Pista · July 18, 2019, 7:50am

Working on infrastructure built over AWS, since there are some special cases that logs are stored only in S3 buckets, if we want to use ELK to analyse these logs, we need to use 'S3 Input plugin' [1].

Unfortunately, the CPU usage of 'logstash' process is very, very high and also the Incoming network load [Network in] is very high, if we will take in count that we have 'sincedb_path' in use.

On the picture above, it is possible to see, that after I add S3 Access logs (they are from 5/7/2019 also analysed with/sent to ELK), the 'Network In' load is growing daily. To me it seems like the S3 input plugin is scanning whole bucket of S3 Access logs (logs of S3 Access log appears like one line in separate file - so there is quite huge number of files, in that S3 bucket).

Currently we are using 16x S3 input plugin in logstash.conf (in the input part) - please, see example of configuration below*.

input
{
    s3
    {
        bucket => "<bucket_name>"
        prefix => "production/lb/<path>/elasticloadbalancing/eu-west-1/"
        region => "eu-west-1"
        type => "alblogs"
        codec => plain
        sincedb_path => "/opt/<path>/elasticsearch/plugins/repository-s3/alblogs.txt"
        secret_access_key => "<secret>"
        access_key_id => "<access_key_id>"
    }

...

 filter {
   if [type] == "alblogs" {
      grok {
         match => ["message", "%{TIMESTAMP_ISO8601:timestamp} %{NOTSPACE:loadbalancer} %{IP:client_ip}:%{NUMBER:client_port:int} (?:%{IP:backend_ip}:%{NUMBER:backend_port:int}|-) %{NUMBER:request_processing_time:float} %{NUMBER:backend_processing_time:float} %{NUMBER:response_processing_time:float} (?:%{NUMBER:elb_status_code:int}|-) (?:%{NUMBER:backend_status_code:int}|-) %{NUMBER:received_bytes:int} %{NUMBER:sent_bytes:int} \"(?:%{WORD:verb}|-) (?:%{GREEDYDATA:request}|-) (?:HTTP/%{NUMBER:httpversion}|-( )?)\" \"%{DATA:userAgent}\"( %{NOTSPACE:ssl_cipher} %{NOTSPACE:ssl_protocol})?"]
        match => [ "request", "%{UUID:event_uuid}" ]
      }

...

  if [type] == "s3_production" {
    grok {
        match => ["message", "%{NOTSPACE:s3_owner}[ \t](-|%{HOSTNAME:s3_bucket})[ \t]\[%{HTTPDATE:timestamp}\][ \t]%{IP:s3_remote_ip}[ \t]%{NOTSPACE:Requester}[ \t]%{NOTSPACE:RequesterID}[ \t]%{NOTSPACE:s3_operation}[ \t]%{NOTSPACE:s3_key}[ \t]%{NOTSPACE:request_method}[ \t]%{NOTSPACE:request_url}[ \t]%{NOTSPACE:request_protocol}[ \t]%{NUMBER:HTTP_status}[ \t]%{NOTSPACE:s3_errorCode}[ \t]%{NOTSPACE:s3_bytesSent}[ \t]%{NOTSPACE:s3_objectSize}[ \t]%{NUMBER:s3_totalTime}[ \t]%{NOTSPACE:s3_turnaroundTime}[ \t]\"%{NOTSPACE:Referrer}\"[ \t]\"%{GREEDYDATA:UserAgent}\"[ \t]%{NOTSPACE:s3_versionId}[ \t]%{NOTSPACE:s3_hostId}[ \t]%{NOTSPACE:s3_signarureVersion}[ \t]%{NOTSPACE:s3_cipherSuite}[ \t]%{NOTSPACE:s3_authType}[ \t]%{HOSTNAME:s3_hostHeader}[ \t]%{NOTSPACE:s3_TLSversion}"]

    add_tag => [ "production" ]
    tag_on_failure => [ "S3.EXPIRE.OBJECT" ]
    }
    mutate {
        remove_field => [ "message" ]
    }
  }

...

output
{

   if [type] == "alblogs" {
    elasticsearch {
        hosts => ["127.0.0.1:9200"]
        index => "alblogs-%{+YYYY.MM.dd}"
       }
   if [type] == "s3_production" {
    elasticsearch {
        hosts => ["127.0.0.1:9200"]
        index => "s3-production-%{+YYYY.MM.dd}"
       }
    }

Some advise, please? Is it normal? Could somebody please help?

*The parsing (grok) doesn't seems to be the issue. I tried to optimalize it as I could.

[1] https://www.elastic.co/guide/en/logstash/current/plugins-inputs-s3.html

Badger · July 18, 2019, 9:46am

Does this help?

Pista · July 18, 2019, 1:22pm

watch_for_new_files
Whether or not to watch for new files. Disabling this option causes the input to close itself after processing the files from a single listing.

When I was reading the documentation, I thought that I will lose the ability to fetch new files.

I didn't try it until now, and it helped, indeed!

I am fetching all data what I was fetching before, but the CPU usage of logstash process and Network In usage is way lower [from 2xx % and more, into 1x %].

Very last thing, what is bothering me, after this cool fix, is the fact that AWS is sending logs in bunches, let's say every 5 or 15 minutes in that time, I have peak of CPU usage still, is that possible to distribute this load somehow? Let's say, it is possible to do it in some small pieces?

Thank you for your help @Badger!

Badger · July 18, 2019, 1:40pm

You could reduce the number of pipeline worker threads. Other than that, once it has work to do logstash will try to do it as quickly as possible.

system · August 15, 2019, 1:40pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
S3 input plugin: since_db doesn't work properly: high CPU usage of logstash when more files in folder Logstash	1	924	October 21, 2019
Unusual number of HEAD requests being made by S3 input plugin Logstash	1	479	July 6, 2017
S3 input plugin taking really long time to process Logstash	5	561	October 11, 2022
S3 Input Plugin Not Fully Processing ELB Access Logs Logstash	2	1676	September 26, 2017
Ingesting High Volume of AWS Flowlogs Logstash	2	279	August 12, 2020

CPU usage of Logstash too high when used with S3 Input plugin

Related topics