Filebeat and Metricbeat process cause CPU high for 8.2.0 version

wisdomluo · June 15, 2022, 1:40am

top - 09:18:06 up 7 min,  1 user,  load average: 10.19, 8.81, 4.51
Tasks: 124 total,   3 running,  58 sleeping,   0 stopped,   1 zombie
%Cpu(s): 82.0 us,  7.7 sy,  0.0 ni, 10.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  1490700 total,    59020 free,   492076 used,   939604 buff/cache
KiB Swap:  3145724 total,  3142628 free,     3096 used.   916644 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                         
 1533 root      20   0 1415588 151740  78360 R  91.0 10.2   6:49.63 metricbeat                                      
 1542 root      20   0 1305404 118344  59992 S  91.0  7.9   6:44.16 filebeat                                        
 1569 root      20   0 1349988 141756  78336 S  90.4  9.5   6:44.70 metricbeat                                      
 1560 root      20   0 1297208 115584  59272 R  85.0  7.8   6:37.57 filebeat

The log files in directory “/opt/Elastic/Agent/data/elastic-agent-b9a28a/logs/default” rotates quickly, there are lot of logs show transport authentication handshake failed x509.

log file: /opt/Elastic/Agent/data/elastic-agent-b9a28a/logs/default/filebeat-20220615-909.ndjson

{"log.level":"error","@timestamp":"2022-06-15T09:19:25.026+0800","log.logger":"centralmgmt.fleet","log.origin":{"file.name":"management/manager.go","file.line":250},"message":"elastic-agent-client got error: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: x509: certificate has expired or is not yet valid: current time 2022-06-15T09:19:20+08:00 is before 2022-06-15T09:10:18Z\"","service.name":"filebeat","ecs.version":"1.6.0"}

log file: /opt/Elastic/Agent/data/elastic-agent-b9a28a/logs/default/metricbeat-20220615-887.ndjson

{"log.level":"error","@timestamp":"2022-06-15T09:19:24.184+0800","log.logger":"centralmgmt.fleet","log.origin":{"file.name":"management/manager.go","file.line":250},"message":"elastic-agent-client got error: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: x509: certificate has expired or is not yet valid: current time 2022-06-15T09:19:19+08:00 is before 2022-06-15T09:10:18Z\"","service.name":"metricbeat","ecs.version":"1.6.0"}

[root@apm-server-nginx-nodejs ~]# ps -ef | grep agent
root      1649     1  1 09:21 ?        00:00:14 /opt/Elastic/Agent/elastic-agent
root      1658  1649  0 09:21 ?        00:00:00 [elastic-agent] <defunct>
root      1666  1649  1 09:21 ?        00:00:11 /opt/Elastic/Agent/data/elastic-agent-b9a28a/install/metricbeat-8.2.0-linux-arm64/metricbeat -E setup.ilm.enabled=false -E setup.template.enabled=false -E management.enabled=true -E logging.level=debug -E gc_percent=${METRICBEAT_GOGC:100} -E metricbeat.config.modules.enabled=false -E logging.level=warning -E http.enabled=true -E http.host=unix:///opt/Elastic/Agent/data/tmp/default/metricbeat/metricbeat.sock -E logging.files.path=/opt/Elastic/Agent/data/elastic-agent-b9a28a/logs/default -E logging.files.name=metricbeat -E logging.files.keepfiles=7 -E logging.files.permission=0640 -E logging.files.interval=1h -E path.data=/opt/Elastic/Agent/data/elastic-agent-b9a28a/run/default/metricbeat--8.2.0
root      1675  1649  0 09:21 ?        00:00:02 /opt/Elastic/Agent/data/elastic-agent-b9a28a/install/apm-server-8.2.0-linux-arm64/apm-server -E management.enabled=true -E gc_percent=${APMSERVER_GOGC:100} -E logging.level=warning -E http.enabled=true -E http.host=unix:///opt/Elastic/Agent/data/tmp/default/apm-server/apm-server.sock -E logging.files.path=/opt/Elastic/Agent/data/elastic-agent-b9a28a/logs/default -E logging.files.name=apm-server -E logging.files.keepfiles=7 -E logging.files.permission=0640 -E logging.files.interval=1h -E path.data=/opt/Elastic/Agent/data/elastic-agent-b9a28a/run/default/apm-server--8.2.0
root      1684  1649  0 09:21 ?        00:00:04 /opt/Elastic/Agent/data/elastic-agent-b9a28a/install/filebeat-8.2.0-linux-arm64/filebeat -E setup.ilm.enabled=false -E setup.template.enabled=false -E management.enabled=true -E logging.level=debug -E gc_percent=${FILEBEAT_GOGC:100} -E filebeat.config.modules.enabled=false -E logging.level=warning -E http.enabled=true -E http.host=unix:///opt/Elastic/Agent/data/tmp/default/filebeat/filebeat.sock -E logging.files.path=/opt/Elastic/Agent/data/elastic-agent-b9a28a/logs/default -E logging.files.name=filebeat -E logging.files.keepfiles=7 -E logging.files.permission=0640 -E logging.files.interval=1h -E path.data=/opt/Elastic/Agent/data/elastic-agent-b9a28a/run/default/filebeat--8.2.0
root      1695  1649  2 09:21 ?        00:00:27 /opt/Elastic/Agent/data/elastic-agent-b9a28a/install/filebeat-8.2.0-linux-arm64/filebeat -E setup.ilm.enabled=false -E setup.template.enabled=false -E management.enabled=true -E logging.level=debug -E gc_percent=${FILEBEAT_GOGC:100} -E filebeat.config.modules.enabled=false -E logging.level=warning -E path.data=/opt/Elastic/Agent/data/elastic-agent-b9a28a/run/default/filebeat--8.2.0--36643631373035623733363936343635
root      1703  1649  0 09:21 ?        00:00:03 /opt/Elastic/Agent/data/elastic-agent-b9a28a/install/metricbeat-8.2.0-linux-arm64/metricbeat -E setup.ilm.enabled=false -E setup.template.enabled=false -E management.enabled=true -E logging.level=debug -E gc_percent=${METRICBEAT_GOGC:100} -E metricbeat.config.modules.enabled=false -E logging.level=warning -E path.data=/opt/Elastic/Agent/data/elastic-agent-b9a28a/run/default/metricbeat--8.2.0--36643631373035623733363936343635

I have set the ssl config for Elasticsearch, but the issue still happen. I don't know which service authenticate failed with Filebeat and Metricbeat, could you take a look at this issue? Thanks so much!

Andrea_Spacca · June 15, 2022, 2:14am

Hello @wisdomluo , welcome to the Elastic community!

Indeed it seems there is a problem on the timezones for the current time reported by the elastic agent client used by filebeat and metricbeat and the certificate time on elastic agent.

Not sure if this is the cause of the high CPU usage.

What system timezone do you have on the host running the agent?

Would it be possible for you to set it to UTC and restart the agent in order to see if the tls handshake will succeed and at the same time the CPU usage will decrease?

wisdomluo · June 15, 2022, 2:28am

Thank for the reply, the timezone is CST on all my hosts(VM), should I set them to UTC? Actually, the issue(CPU high and x509 authentication handshake failed) gone once I restart the elastic agent process. But the issue will happen again after I restart the VM along with the Elastic agent process start first time.

[root@apm-server-nginx-nodejs ~]# timedatectl
      Local time: 三 2022-06-15 10:15:58 CST
  Universal time: 三 2022-06-15 02:15:58 UTC
        RTC time: 三 2022-06-15 02:15:59
       Time zone: Asia/Shanghai (CST, +0800)
     NTP enabled: yes
NTP synchronized: yes
 RTC in local TZ: no
      DST active: n/a

Andrea_Spacca · June 15, 2022, 7:02am

yes. this should solve the problem with the tls handshake

could you clarify your setup?
you are running the agent inside a VM, correct? if you restart the agent inside the VM that's still running the error is gone, but if you restart the VM it appears again until you restart the agent inside the running VM, correct?

what kind of VM is it? there is any init script in the VM that it's executed upon restart?

Andrea_Spacca · June 15, 2022, 7:03am

What version of the agent are you running?

wisdomluo · June 15, 2022, 7:17am

I think it is 8.2.0, elastic-agent-8.2.0-linux-arm64.tar.gz is the package I installed for elastic agent.

wisdomluo · June 15, 2022, 7:23am

you are running the agent inside a VM, correct?
Yes.
if you restart the agent inside the VM that's still running the error is gone, but if you restart the VM it appears again until you restart the agent inside the running VM, correct?
Yes.

what kind of VM is it? there is any init script in the VM that it's executed upon restart?

[root@apm-server-nginx-nodejs ~]# cat /etc/os-release 
NAME="CentOS Linux"
VERSION="7 (AltArch)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (AltArch)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7:server"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

[root@apm-server-nginx-nodejs ~]# uname -a
Linux apm-server-nginx-nodejs 5.11.12-300.el7.aarch64 #1 SMP Thu Aug 19 09:02:08 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux
[root@apm-server-nginx-nodejs ~]#

there is no any customized init script to execute upon restart.

VM1: Elasticsearch,kibana,elastic agent(fleet server,filebeat,metricbeat)
VM2: elastic agent(apm server,filebeat,metricbeat)
VM3: elastic agent(filebeat,metricbeat)

Andrea_Spacca · June 16, 2022, 3:04am

@wisdomluo
you can try to either set the VM to UTC or add a cron (or similar solution) to restart the agent at @reboot

does the issue happen on every of the three VMs?

wisdomluo · June 16, 2022, 9:35am

Setting timezone to UTC is not effective, the issue still exist. Anyway, I plan to have a try with other VMs in several days.
Currently, when the issue happen, the elastic agent stuck in starting status. Could you explain about below logs? All those authentication failure log records contain "2022-06-16T10:13:44". It should be the time while my VM startup.

transport: authentication handshake failed: x509: certificate has expired or is not yet valid: current time 2022-06-16T10:16:51+08:00 is before 2022-06-16T10:13:44Z

wisdomluo · June 17, 2022, 1:13am

Yes, issue happen on every of the three VMs. CPU high along with the authentication failure log rotate very quickly, and elastic agent stuck in starting status. Although this is not in production environment, I just worry about the error will cause CPU high to impact business service if I deploy the elastic agent to production environment.

system · July 15, 2022, 3:14am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.