CPU usage metric via CAT API misbehaving after upgrading to 7.16.3 (running in LXD)

juan.domenech · January 17, 2022, 6:40pm

Problem:

In a 10 core CPU, 64GB RAM system, Centos 7.9 under LXD container virtualisation
Elasticsearch reports wrong CPU usage after upgrading to 7.16.3
This issue wasn't present on 7.15.1
We use this metric to track the health of the cluster but now it shows 100% CPU use most of the time
This metric has been confirmed to not be true using other OS CPU tools

Reproduce:

Upgrade to Elasticsearch 7.16.3
Give work to the cluster to increase the CPU usage
Obtain CPU usage via CAT API curl -uuser:pass https://localhost:9200/_cat/nodes?v
CAT API will report a sustained 100% CPU use

# curl -uuser:pass https://localhost:9200/_cat/nodes?v
ip     heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.x.x.x          6          98 100   21.44   23.53    24.91 dhimrsw   -      nodeXXXX
10.x.x.x         48         100 100   18.33   17.49    16.85 dhimrsw   -      nodeXXXX
10.x.x.x         32          99 100   10.52   12.45    13.57 dhimrsw   -      nodeXXXX
10.x.x.x         21          98 100   29.13   33.93    34.91 dhimrsw   *      nodeXXXX
10.x.x.x         35         100 100   16.02   17.13    16.59 dhimrsw   -      nodeXXXX
10.x.x.x         47          97 100   32.28   30.24    30.42 dhimrsw   -      nodeXXXX
10.x.x.x         41          95 100    8.22    7.02     7.23 dhimrsw   -      nodeXXXX
10.x.x.x         19          99 100   14.26   13.40    13.54 dhimrsw   -      nodeXXXX
10.x.x.x         51          95 100   30.86   28.44    27.42 dhimrsw   -      nodeXXXX
10.x.x.x          4          96 100   24.88   22.34    21.95 dhimrsw   -      nodeXXXX
10.x.x.x         23          96 100   14.34   15.02    16.20 dhimrsw   -      nodeXXXX
10.x.x.x         19          95 100   22.19   20.98    20.42 dhimrsw   -      nodeXXXX

Most of the time each node will report 100% CPU.

We can see the effect on the metric after the patch and a rolling restart (the load in the cluster hasn't changed):

Are we aware of this issue? Is it happening to anyone else?

Thanks!
Juan

stephenb · January 17, 2022, 6:51pm

Hi @juan.domenech

What OS are you running on... is this Docker or Directly on the OS / VM?

juan.domenech · January 17, 2022, 6:55pm

Hi!

This is Centos 7.9 in a VM (10 cores and 64GB RAM per node and 12 nodes/VMs in the cluster).

stephenb · January 17, 2022, 7:19pm

Couple more questions

Are you using the bundled JDK or your Own?

Also curious what VM solution are you using... or is this AWS ec2 etc.

juan.domenech · January 17, 2022, 7:25pm

No problem!

On-premise Hypervisors running LXD containers (we call them VM but I guess technically they are not)
Standard RPM, bundled JVM https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.16.3-x86_64.rpm

Your question gave me an idea. We also have older Hypervisors running KVM (real VM). I'm going to test 7.16.3 there to see if the underlaying tech has something to do with this.

I'll be back!

juan.domenech · January 17, 2022, 8:14pm

I can confirm that in a VM running a traditional virtualisation layer (KVM) this issue is not present. Elasticsearch 7.16.3 reports CPU usage correctly.
I'll update the original post accordingly.

This does not explain why this metric broke between versions but it is an important clue.

stephenb · January 17, 2022, 8:33pm

Could perhaps be related to a bug we opened against OpenJDK but may not explain why this affected when you upgraded.

https://bugs.openjdk.java.net/browse/JDK-8248215

juan.domenech · January 19, 2022, 5:21pm

I really don't know TBH.
But seeing this changing between versions (without changes in our OS) makes me think about a code change.

After a more detailed look to my graphs, I think that there is a bad calculation somewhere:

A node that in 7.15.1 was reporting around 7% CPU use, not reports around 70% with 7.16.3.
If we factor in that this nodes is a 10 Core LXD container it looks like the CPU metric is multiplied by the number of cores and capped at 100.

I see some recent changes on OsProbe.java in that area and I wonder if @rory.hunter could give us some directions

rory.hunter · January 24, 2022, 10:20am

My changes were to support cGroups v2. If the OS was using v2, then no metrics would have been available at all before those changes, which doesn't appear to be the case. The JVM bug seems relevant though.

juan.domenech · January 25, 2022, 8:43pm

Yes it looks like a bug but not sure is the one mentioned earlier (Thanks @stephenb !)

Between Elasticsearch versions Java went from 17 to 17.0.1 so it doesn't look like a big opportunity for that type of bug to get in there:

[root@ ~]# rpm -qa|grep elasticsearch
elasticsearch-7.15.1-1.x86_64
[root@ ~]# /usr/share/elasticsearch/jdk/bin/java -version
openjdk version "17" 2021-09-14
OpenJDK Runtime Environment Temurin-17+35 (build 17+35)
OpenJDK 64-Bit Server VM Temurin-17+35 (build 17+35, mixed mode, sharing)

[root@ ~]# rpm -qa|grep elasticsearch
elasticsearch-7.16.3-1.x86_64
[root@ ~]# /usr/share/elasticsearch/jdk/bin/java -version
openjdk version "17.0.1" 2021-10-19
OpenJDK Runtime Environment Temurin-17.0.1+12 (build 17.0.1+12)
OpenJDK 64-Bit Server VM Temurin-17.0.1+12 (build 17.0.1+12, mixed mode, sharing)

Anyhow, let's wait a bit and see if someone else bumps into this (I'm afraid LXD containers are not very common).

Thanks!
Juan

system · February 22, 2022, 8:43pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
High CPU usage periodically Elasticsearch	9	561	October 26, 2023
Cat Nodes API returning CPU usage as -1. Elasticsearch version:7.11.2 OS Deb 11 Elasticsearch	2	212	February 8, 2023
Elasticsearch 7.3 CPU Usage Elasticsearch	3	868	September 18, 2019
Help me debug CPU use issues Elasticsearch	15	1081	July 6, 2017
Elastic eating 25% CPU Elasticsearch	7	1924	December 27, 2017

CPU usage metric via CAT API misbehaving after upgrading to 7.16.3 (running in LXD)

Related topics