Logstash high load and CPU usage

Hi,

I am running an Elastic Stack with multiple Logstash servers in different networks to aggregate, filter and forward the logs. For some time now I have the problem that some of these Logstash nodes regularly have a very high load and CPU usage. When I restart the Logstash service, it is all fine again for a while. You can see this behavior in this screenshot of the Elastic Monitoring

I already searched for quite a bit on this problem, but still have no clue what exactly is causing these increased loads. I would be very happy if anyone has any idea what might be the cause and could point me in a direction!

Some more information on my configuration:

The specs of the nodes:

4 vCPUs
8GB RAM
6GB Heap

Logstash config:

# Ansible managed

pipeline.ordered: auto
path:
  data: /var/lib/logstash
  logs: /var/log/logstash

xpack.monitoring.enabled: false
monitoring.enabled: false
monitoring.cluster_uuid: "uuid"

xpack.management:
  enabled: true
  elasticsearch:
    hosts: ["https://elastic1:9200", "https://elastic2:9200", "https://elastic3:9200"]
    username: "logstash_internal"
    password: "password"
    ssl:
      verification_mode: certificate
      certificate_authority: /etc/logstash/certs/elastic-stack-ca.pem
  logstash.poll_interval: "5s"
  pipeline.id: ["lan"]

Pipeline:

input {
  elastic_agent {
    host            => "${IP_ADDRESS}"
    port            => 5044
    ssl_enabled     => true
    ssl_certificate => "/etc/logstash/certs/logstash.crt.pem"
    ssl_key         => "/etc/logstash/certs/logstash.key.pem"
    ssl_client_authentication => "none"
    type            => "elastic_agent"
  }
  gelf {
    host    => "${IP_ADDRESS}"
    use_udp => false
    use_tcp => true
    port    => 12201
    type    => "gelf"
  }
  syslog {
    host              => "${IP_ADDRESS}"
    port              => 10514
    type              => "syslog"
    proxy_protocol    => true
    ecs_compatibility => "v8"
  }
}

filter {
  if [host][hostname] in ["server1", "server2", "server3", "server4"] {
    mutate {
      add_field => {
        "[data_stream][type]"      => "logs"
        "[data_stream][dataset]"   => "webservices"
        "[data_stream][namespace]" => "dev"
      }
    }
  }
  if [host][hostname] in ["server5", "server6", "server7", "server8", "server9"] {
    mutate {
      add_field => {
        "[data_stream][type]"      => "logs"
        "[data_stream][dataset]"   => "webservices"
        "[data_stream][namespace]" => "prod"
      }
    }
  }
}

output {
  if ([data_stream][type] and [data_stream][type] != "" ) and ([data_stream][dataset] and [data_stream][dataset] != "" ) {
    elasticsearch {
      hosts                        => ["https://elastic1:9200", "https://elastic2:9200", "https://elastic3:9200"]
      data_stream                  => "true"
      user                         => "logstash_internal"
      password                     => "password"
      ssl_enabled                  => "true"
      ssl_verification_mode        => "full"
      ssl_certificate_authorities  => "/etc/logstash/certs/elastic-stack-ca.pem"
    }
  } else if [type] == "syslog" {
    elasticsearch {
      hosts                       => ["https://elastic1:9200", "https://elastic2:9200", "https://elastic3:9200"]
      ilm_enabled                 => true
      ilm_rollover_alias          => "syslog"
      ilm_pattern                 => "{now/d}-000001"
      ilm_policy                  => "syslog"
      user                        => "logstash_internal"
      password                    => "password"
      ssl_enabled                 => "true"
      ssl_verification_mode       => "full"
      ssl_certificate_authorities => "/etc/logstash/certs/elastic-stack-ca.pem"
    }
  }
  if [type] == "gelf" {
    elasticsearch {
      hosts                        => ["https://elastic1:9200", "https://elastic2:9200", "https://elastic3:9200"]
      ilm_enabled                  => true
      ilm_rollover_alias           => "gelf"
      ilm_pattern                  => "{now/d}-000001"
      ilm_policy                   => "gelf"
      user                         => "logstash_internal"
      password                     => "password"
      ssl_enabled                  => "true"
      ssl_verification_mode        => "full"
      ssl_certificate_authorities  => "/etc/logstash/certs/elastic-stack-ca.pem"
    }
  }
}

Are you using persistent queue or memory queues? It is not clear.

Is logstash the only service running on this server?

Do you have anything in the logs?

Thanks for the quick reply. I gladyl provide more information:

My Logstash nodes are using the default memory queues

Apart from Logstash the only noteworthy services running on the servers would be the Elastic Agent and HAproxy/Keepalived for some load balancing of the syslog input between two Logstash nodes. But these processes hardly use any CPU or memory according to htop.

The only messages I get in the logstash-plain.log are like these:

[2024-04-09T17:04:27,340][INFO ][logstash.inputs.syslog   ][lan][1fa9b49c9429d1d0a7fa6399b888deb8ae2ac1a205deaf8ccf38ff44b5e2ed5b] new connection {:client=>"10.20.30.40:52270"}
[2024-04-09T17:04:26,503][WARN ][logstash.outputs.elasticsearch][lan][8da42123ea1df5ae3882151c817efd71ae382fac24ed2de9d61bc5c93419a5f5] Failed action {:status=>409, :action=>["create", {:_id=>nil, :_index=>"metrics-apache_tomcat.cache-default", :routing=>nil}, {"service"=>{"type"=>"prometheus", "address"=>"http://localhost:9090/metrics"}, "elastic_agent"=>{"id"=>"522874b7-bd30-487c-8c9f-a1fd3564e589", "version"=>"8.8.1", "snapshot"=>false}, "@version"=>"1", "type"=>"elastic_agent", "tags"=>["apache_tomcat-cache", "beats_input_raw_event"], "ecs"=>{"version"=>"8.0.0"}, "agent"=>{"name"=>"hostname", "type"=>"metricbeat", "id"=>"522874b7-bd30-487c-8c9f-a1fd3564e589", "version"=>"8.8.1", "ephemeral_id"=>"d55430eb-c910-417c-bf80-971cb2b62c25"}, "prometheus"=>{"labels"=>{"name"=>"Cache", "job"=>"prometheus", "context"=>"/manager", "host"=>"localhost", "instance"=>"localhost:9090"}, "metrics"=>{"Catalina_WebResourceRoot_maxSize"=>10240, "Catalina_WebResourceRoot_ttl"=>5000, "Catalina_WebResourceRoot_size"=>12, "Catalina_WebResourceRoot_objectMaxSize"=>512, "Catalina_WebResourceRoot_lookupCount"=>13, "Catalina_WebResourceRoot_hitCount"=>4}}, "metricset"=>{"name"=>"collector", "period"=>10000}, "data_stream"=>{"type"=>"metrics", "dataset"=>"apache_tomcat.cache", "namespace"=>"default"}, "event"=>{"dataset"=>"apache_tomcat.cache", "duration"=>151052696, "module"=>"prometheus"}, "host"=>{"name"=>"hostname", "id"=>"90ad598b369d41f68860a2898fb81488", "mac"=>["00-00-00-00-00-00"], "architecture"=>"x86_64", "hostname"=>"hostname", "os"=>{"platform"=>"ol", "name"=>"Oracle Linux Server", "kernel"=>"5.15.0-102.110.5.1.el9uek.x86_64", "type"=>"linux", "version"=>"9.2", "family"=>"redhat"}, "ip"=>["10.20.30.40"], "containerized"=>false}, "@timestamp"=>2024-04-09T15:04:25.221Z}], :response=>{"create"=>{"_index"=>".ds-metrics-apache_tomcat.cache-default-2024.03.23-000003", "_id"=>"8aSlLs8Em-fXACmJAAABjsNjjYU", "status"=>409, "error"=>{"type"=>"version_conflict_engine_exception", "reason"=>"[8aSlLs8Em-fXACmJAAABjsNjjYU][{agent.id=522874b7-bd30-487c-8c9f-a1fd3564e589, apache_tomcat.cache.application_name=/manager, host.name=hostname, service.address=http://localhost:9090/metrics}@2024-04-09T15:04:25.221Z]: version conflict, document already exists (current version [1])", "index_uuid"=>"6nhJvzKxT8axN1Z_4EzFew", "shard"=>"0", "index"=>".ds-metrics-apache_tomcat.cache-default-2024.03.23-000003"}}}}

But for me, these don't appear particularly related to my problem.

The screenshot suggests the JVM heap is continuously growing, and the CPU and system load are growing along with it. It looks like the heap only shrinks significantly when the JVM is restarted (leading to the brief gaps in the monitoring data).

That suggests a GC issue. I would enable GC logging (how to do that depends on the JVM, its version and the options you are using). That will show you the time spent on GC. Then get a heap dump and take a look at what is using up the heap. See this thread.

1 Like

what I have discover that if you have stdout in your output section and if you processing lot of documents then load will go super high.

Thank you for the suggestion! I am not entirely convinced that it has something to do with the heap space though. That the heap and the load in the screenshot both drop at the time is because the Logstash service got restarted. At other times I also saw the garbage collector running and freeing up heap space and the load still remaining high.
I will have a look into your suggestion anyway and see if I can find anything out!

I typically run logstash with 200 MB of heap. If the heap is growing to 5 GB then I cannot think of any explanation other than a memory leak.

Like I already assumed, the problem wasn't the heap itself. It was the syslog input, like described here:

An update of the syslog input plugin to version 3.7.0 solved the problem.

I will have a look into your suggestion anyway and see if I can find anything out!