Logstash high load and CPU usage

pk92 · April 9, 2024, 2:07pm

Hi,

I am running an Elastic Stack with multiple Logstash servers in different networks to aggregate, filter and forward the logs. For some time now I have the problem that some of these Logstash nodes regularly have a very high load and CPU usage. When I restart the Logstash service, it is all fine again for a while. You can see this behavior in this screenshot of the Elastic Monitoring

I already searched for quite a bit on this problem, but still have no clue what exactly is causing these increased loads. I would be very happy if anyone has any idea what might be the cause and could point me in a direction!

Some more information on my configuration:

The specs of the nodes:

4 vCPUs
8GB RAM
6GB Heap

Logstash config:

# Ansible managed

pipeline.ordered: auto
path:
  data: /var/lib/logstash
  logs: /var/log/logstash

xpack.monitoring.enabled: false
monitoring.enabled: false
monitoring.cluster_uuid: "uuid"

xpack.management:
  enabled: true
  elasticsearch:
    hosts: ["https://elastic1:9200", "https://elastic2:9200", "https://elastic3:9200"]
    username: "logstash_internal"
    password: "password"
    ssl:
      verification_mode: certificate
      certificate_authority: /etc/logstash/certs/elastic-stack-ca.pem
  logstash.poll_interval: "5s"
  pipeline.id: ["lan"]

Pipeline:

input {
  elastic_agent {
    host            => "${IP_ADDRESS}"
    port            => 5044
    ssl_enabled     => true
    ssl_certificate => "/etc/logstash/certs/logstash.crt.pem"
    ssl_key         => "/etc/logstash/certs/logstash.key.pem"
    ssl_client_authentication => "none"
    type            => "elastic_agent"
  }
  gelf {
    host    => "${IP_ADDRESS}"
    use_udp => false
    use_tcp => true
    port    => 12201
    type    => "gelf"
  }
  syslog {
    host              => "${IP_ADDRESS}"
    port              => 10514
    type              => "syslog"
    proxy_protocol    => true
    ecs_compatibility => "v8"
  }
}

filter {
  if [host][hostname] in ["server1", "server2", "server3", "server4"] {
    mutate {
      add_field => {
        "[data_stream][type]"      => "logs"
        "[data_stream][dataset]"   => "webservices"
        "[data_stream][namespace]" => "dev"
      }
    }
  }
  if [host][hostname] in ["server5", "server6", "server7", "server8", "server9"] {
    mutate {
      add_field => {
        "[data_stream][type]"      => "logs"
        "[data_stream][dataset]"   => "webservices"
        "[data_stream][namespace]" => "prod"
      }
    }
  }
}

output {
  if ([data_stream][type] and [data_stream][type] != "" ) and ([data_stream][dataset] and [data_stream][dataset] != "" ) {
    elasticsearch {
      hosts                        => ["https://elastic1:9200", "https://elastic2:9200", "https://elastic3:9200"]
      data_stream                  => "true"
      user                         => "logstash_internal"
      password                     => "password"
      ssl_enabled                  => "true"
      ssl_verification_mode        => "full"
      ssl_certificate_authorities  => "/etc/logstash/certs/elastic-stack-ca.pem"
    }
  } else if [type] == "syslog" {
    elasticsearch {
      hosts                       => ["https://elastic1:9200", "https://elastic2:9200", "https://elastic3:9200"]
      ilm_enabled                 => true
      ilm_rollover_alias          => "syslog"
      ilm_pattern                 => "{now/d}-000001"
      ilm_policy                  => "syslog"
      user                        => "logstash_internal"
      password                    => "password"
      ssl_enabled                 => "true"
      ssl_verification_mode       => "full"
      ssl_certificate_authorities => "/etc/logstash/certs/elastic-stack-ca.pem"
    }
  }
  if [type] == "gelf" {
    elasticsearch {
      hosts                        => ["https://elastic1:9200", "https://elastic2:9200", "https://elastic3:9200"]
      ilm_enabled                  => true
      ilm_rollover_alias           => "gelf"
      ilm_pattern                  => "{now/d}-000001"
      ilm_policy                   => "gelf"
      user                         => "logstash_internal"
      password                     => "password"
      ssl_enabled                  => "true"
      ssl_verification_mode        => "full"
      ssl_certificate_authorities  => "/etc/logstash/certs/elastic-stack-ca.pem"
    }
  }
}

leandrojmp · April 9, 2024, 2:27pm

Are you using persistent queue or memory queues? It is not clear.

Is logstash the only service running on this server?

Do you have anything in the logs?

pk92 · April 9, 2024, 3:15pm

Thanks for the quick reply. I gladyl provide more information:

My Logstash nodes are using the default memory queues

Apart from Logstash the only noteworthy services running on the servers would be the Elastic Agent and HAproxy/Keepalived for some load balancing of the syslog input between two Logstash nodes. But these processes hardly use any CPU or memory according to htop.

The only messages I get in the logstash-plain.log are like these:

[2024-04-09T17:04:27,340][INFO ][logstash.inputs.syslog   ][lan][1fa9b49c9429d1d0a7fa6399b888deb8ae2ac1a205deaf8ccf38ff44b5e2ed5b] new connection {:client=>"10.20.30.40:52270"}

[2024-04-09T17:04:26,503][WARN ][logstash.outputs.elasticsearch][lan][8da42123ea1df5ae3882151c817efd71ae382fac24ed2de9d61bc5c93419a5f5] Failed action {:status=>409, :action=>["create", {:_id=>nil, :_index=>"metrics-apache_tomcat.cache-default", :routing=>nil}, {"service"=>{"type"=>"prometheus", "address"=>"http://localhost:9090/metrics"}, "elastic_agent"=>{"id"=>"522874b7-bd30-487c-8c9f-a1fd3564e589", "version"=>"8.8.1", "snapshot"=>false}, "@version"=>"1", "type"=>"elastic_agent", "tags"=>["apache_tomcat-cache", "beats_input_raw_event"], "ecs"=>{"version"=>"8.0.0"}, "agent"=>{"name"=>"hostname", "type"=>"metricbeat", "id"=>"522874b7-bd30-487c-8c9f-a1fd3564e589", "version"=>"8.8.1", "ephemeral_id"=>"d55430eb-c910-417c-bf80-971cb2b62c25"}, "prometheus"=>{"labels"=>{"name"=>"Cache", "job"=>"prometheus", "context"=>"/manager", "host"=>"localhost", "instance"=>"localhost:9090"}, "metrics"=>{"Catalina_WebResourceRoot_maxSize"=>10240, "Catalina_WebResourceRoot_ttl"=>5000, "Catalina_WebResourceRoot_size"=>12, "Catalina_WebResourceRoot_objectMaxSize"=>512, "Catalina_WebResourceRoot_lookupCount"=>13, "Catalina_WebResourceRoot_hitCount"=>4}}, "metricset"=>{"name"=>"collector", "period"=>10000}, "data_stream"=>{"type"=>"metrics", "dataset"=>"apache_tomcat.cache", "namespace"=>"default"}, "event"=>{"dataset"=>"apache_tomcat.cache", "duration"=>151052696, "module"=>"prometheus"}, "host"=>{"name"=>"hostname", "id"=>"90ad598b369d41f68860a2898fb81488", "mac"=>["00-00-00-00-00-00"], "architecture"=>"x86_64", "hostname"=>"hostname", "os"=>{"platform"=>"ol", "name"=>"Oracle Linux Server", "kernel"=>"5.15.0-102.110.5.1.el9uek.x86_64", "type"=>"linux", "version"=>"9.2", "family"=>"redhat"}, "ip"=>["10.20.30.40"], "containerized"=>false}, "@timestamp"=>2024-04-09T15:04:25.221Z}], :response=>{"create"=>{"_index"=>".ds-metrics-apache_tomcat.cache-default-2024.03.23-000003", "_id"=>"8aSlLs8Em-fXACmJAAABjsNjjYU", "status"=>409, "error"=>{"type"=>"version_conflict_engine_exception", "reason"=>"[8aSlLs8Em-fXACmJAAABjsNjjYU][{agent.id=522874b7-bd30-487c-8c9f-a1fd3564e589, apache_tomcat.cache.application_name=/manager, host.name=hostname, service.address=http://localhost:9090/metrics}@2024-04-09T15:04:25.221Z]: version conflict, document already exists (current version [1])", "index_uuid"=>"6nhJvzKxT8axN1Z_4EzFew", "shard"=>"0", "index"=>".ds-metrics-apache_tomcat.cache-default-2024.03.23-000003"}}}}

But for me, these don't appear particularly related to my problem.

Badger · April 9, 2024, 3:49pm

The screenshot suggests the JVM heap is continuously growing, and the CPU and system load are growing along with it. It looks like the heap only shrinks significantly when the JVM is restarted (leading to the brief gaps in the monitoring data).

That suggests a GC issue. I would enable GC logging (how to do that depends on the JVM, its version and the options you are using). That will show you the time spent on GC. Then get a heap dump and take a look at what is using up the heap. See this thread.

elasticforme · April 9, 2024, 4:13pm

what I have discover that if you have stdout in your output section and if you processing lot of documents then load will go super high.

pk92 · April 10, 2024, 9:03am

Thank you for the suggestion! I am not entirely convinced that it has something to do with the heap space though. That the heap and the load in the screenshot both drop at the time is because the Logstash service got restarted. At other times I also saw the garbage collector running and freeing up heap space and the load still remaining high.
I will have a look into your suggestion anyway and see if I can find anything out!

Badger · April 10, 2024, 4:39pm

I typically run logstash with 200 MB of heap. If the heap is growing to 5 GB then I cannot think of any explanation other than a memory leak.

pk92 · April 18, 2024, 9:02am

Like I already assumed, the problem wasn't the heap itself. It was the syslog input, like described here:

github.com/logstash-plugins/logstash-input-syslog

High CPU usage when clients do not properly disconnect/connection reset

opened 03:47PM - 04 Oct 23 UTC

closed 11:44AM - 17 Oct 23 UTC

edmocosta

bug int-shortlist

**Logstash information**: Logstash version: 8.4.3 JVM version: bundled **Description of the problem including expected versus actual behavior**: The plugin is not properly detecting client disconnections/resets. It seems it keeps trying to read from the closed socket, making the CPU usage grow every time a client disconnects. It may be a JRuby issue (https://github.com/jruby/jruby/issues/7961). The problem is happening on the [socket.each](https://github.com/logstash-plugins/logstash-input-syslog/blob/main/lib/logstash/inputs/syslog.rb#L235) method. When the client connection is closed/reset, the code expects it to raise an `ECONNRESET`, stopping the loop, closing the connection and removing it from the connection counter. Instead, it doesn't raise any error, hangs and burns CPU. An alternative solution is to change it to read using a non-blocking approach. Apparently, the `read_nonblock` is not affected by this issue. ```java Hot threads at 2023-09-25T10:41:31+02:00, busiestThreads=10000: ================================================================================ 61.74 % of cpu usage, state: runnable, thread name: 'input|syslog|tcp|10.1.1.12:5000}', thread id: 1726 app//org.jruby.util.io.PosixShim.read(PosixShim.java:158) app//org.jruby.util.io.OpenFile$2.run(OpenFile.java:1330) app//org.jruby.util.io.OpenFile$2.run(OpenFile.java:1316) ... ``` **Steps to reproduce**: 1. Start Logstash with the following pipeline: ``` input { syslog { port => 5555 } } output { stdout {} } ``` 2. Run the following client code, checking the CPU usage: ```ruby require 'socket' HOST = 'localhost' PORT = 5555 def connect_and_close socket = TCPSocket.new(HOST, PORT) linger = [1,0].pack('ii') socket.setsockopt(Socket::SOL_SOCKET, Socket::SO_LINGER, linger) socket.close end def tcp_receiver(socket) socket.each { |line| puts line } rescue Errno::ECONNRESET puts "connection reset" end server_thread = Thread.new do server_socket = TCPServer.new(HOST, PORT) loop do socket = server_socket.accept Thread.new(socket) do |socket| tcp_receiver(socket) end end end sleep 1 10.times { connect_and_close } ```

An update of the syslog input plugin to version 3.7.0 solved the problem.

diabedon · April 18, 2024, 1:42pm

I will have a look into your suggestion anyway and see if I can find anything out!

Topic		Replies	Views
Another High CPU usage with Logstash Logstash	17	4684	November 26, 2019
Logstash is restarting automatucally with high CPU usage Logstash	6	1509	October 6, 2020
Logstash high CPU Logstash	7	2467	October 11, 2022
Why is my Logstash using huge CPU and how to fix it? Logstash	6	2398	June 12, 2017
Logstash Consuming high CPU Logstash	12	3685	April 28, 2018

Logstash high load and CPU usage

Related topics