High CPU usage on warms caused by metricbeat

We've been troubleshooting an issue for several months that seems to be related to metricbeat. This weekend, within a few hours of starting metricbeat on our warms, hots, percolators and clients (we do not run it on the masters) we saw a large increase in CPU usage on the warms.

This started at 8pm on a Friday, so we know it's not related to traffic increases. CPU usage continued be bottlenecked on the warms at around 100% used until the ES url went down completely Sunday night, right as we requested some automated backups and moves from hot to warm.

Every time I've tried to start metricbeat in the last few weeks we've had an unexplainable outage with ES within a few days. The symptoms aren't quite the same each time. Sometimes we see an increase in CPU to our ingestion service instead, which causes a different type of outage, but CPU usage seems to be a common thread. Stopping metricbeat stabilizes the service every time. Is there a different way we can configure metricbeat that is less risky and won't can't cause outages?

We were running metricbeat on our masters previously, until it caused high CPU usage on the masters themselves, and now we get these stats from a node in our cluster that isn't running ES. This was working well for several months until now.

We are on ES and metricbeat/kibana 7.5.2. This is our /etc/metricbeat/metricbeat.yml, removing some sensitive data with XXXXX -

  hosts: ["es-monitoring.XXXXX.com:9200"]
  path: ${path.config}/modules.d/*.yml
setup.template.overwrite: true
setup.ilm.overwrite: true
setup.ilm.policy_file: "/etc/metricbeat/metricbeat-ilm-policy.conf"
  index.number_of_shards: 1
  index.number_of_replicas: 0
  host: "kibana.XXXXX.com:5601"
  username: XXXXX
  password: XXXXX
- module: elasticsearch
- ccr
- enrich
- cluster_stats
- index
- index_recovery
- index_summary
- ml_job
- node_stats
- shard
  hosts: ["http://localhost:9200"]
  period: 180s
  username: XXXXX
  password: XXXXX
  xpack.enabled: true
  ilm.enabled: true
  ilm.rollover_alias: "metricbeat"

The ES warm cluster url is not at es-monitoring, that is just a single node running a backend cluster. These warm nodes are at es1-url for us. Metricbeat is only connected to these nodes as the process that's running on it, the data is all shipped elsewhere. On the node that's monitoring the master it has almost the same config, except hosts: is set directly to a list of master ips and period: 120s.

Any tips on metricbeat would be very appreciated.

What metricbeat version are you using and also what OS are you running it on? Thank you!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.