CPU usage metric via CAT API misbehaving after upgrading to 7.16.3 (running in LXD)

Problem:

  • In a 10 core CPU, 64GB RAM system, Centos 7.9 under LXD container virtualisation
  • Elasticsearch reports wrong CPU usage after upgrading to 7.16.3
  • This issue wasn't present on 7.15.1
  • We use this metric to track the health of the cluster but now it shows 100% CPU use most of the time
  • This metric has been confirmed to not be true using other OS CPU tools

Reproduce:

  • Upgrade to Elasticsearch 7.16.3
  • Give work to the cluster to increase the CPU usage
  • Obtain CPU usage via CAT API curl -uuser:pass https://localhost:9200/_cat/nodes?v
  • CAT API will report a sustained 100% CPU use
# curl -uuser:pass https://localhost:9200/_cat/nodes?v
ip     heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.x.x.x          6          98 100   21.44   23.53    24.91 dhimrsw   -      nodeXXXX
10.x.x.x         48         100 100   18.33   17.49    16.85 dhimrsw   -      nodeXXXX
10.x.x.x         32          99 100   10.52   12.45    13.57 dhimrsw   -      nodeXXXX
10.x.x.x         21          98 100   29.13   33.93    34.91 dhimrsw   *      nodeXXXX
10.x.x.x         35         100 100   16.02   17.13    16.59 dhimrsw   -      nodeXXXX
10.x.x.x         47          97 100   32.28   30.24    30.42 dhimrsw   -      nodeXXXX
10.x.x.x         41          95 100    8.22    7.02     7.23 dhimrsw   -      nodeXXXX
10.x.x.x         19          99 100   14.26   13.40    13.54 dhimrsw   -      nodeXXXX
10.x.x.x         51          95 100   30.86   28.44    27.42 dhimrsw   -      nodeXXXX
10.x.x.x          4          96 100   24.88   22.34    21.95 dhimrsw   -      nodeXXXX
10.x.x.x         23          96 100   14.34   15.02    16.20 dhimrsw   -      nodeXXXX
10.x.x.x         19          95 100   22.19   20.98    20.42 dhimrsw   -      nodeXXXX

Most of the time each node will report 100% CPU.

We can see the effect on the metric after the patch and a rolling restart (the load in the cluster hasn't changed):

Are we aware of this issue? Is it happening to anyone else?

Thanks!
Juan

Hi @juan.domenech

What OS are you running on... is this Docker or Directly on the OS / VM?

Hi!

This is Centos 7.9 in a VM (10 cores and 64GB RAM per node and 12 nodes/VMs in the cluster).

Couple more questions

Are you using the bundled JDK or your Own?

Also curious what VM solution are you using... or is this AWS ec2 etc.

No problem!

Your question gave me an idea. We also have older Hypervisors running KVM (real VM). I'm going to test 7.16.3 there to see if the underlaying tech has something to do with this.

I'll be back!

I can confirm that in a VM running a traditional virtualisation layer (KVM) this issue is not present. Elasticsearch 7.16.3 reports CPU usage correctly.
I'll update the original post accordingly.

This does not explain why this metric broke between versions but it is an important clue.

1 Like

Could perhaps be related to a bug we opened against OpenJDK but may not explain why this affected when you upgraded.

https://bugs.openjdk.java.net/browse/JDK-8248215

I really don't know TBH.
But seeing this changing between versions (without changes in our OS) makes me think about a code change.

After a more detailed look to my graphs, I think that there is a bad calculation somewhere:

A node that in 7.15.1 was reporting around 7% CPU use, not reports around 70% with 7.16.3.
If we factor in that this nodes is a 10 Core LXD container it looks like the CPU metric is multiplied by the number of cores and capped at 100.

I see some recent changes on OsProbe.java in that area and I wonder if @rory.hunter could give us some directions :slight_smile:

My changes were to support cGroups v2. If the OS was using v2, then no metrics would have been available at all before those changes, which doesn't appear to be the case. The JVM bug seems relevant though.

Yes it looks like a bug but not sure is the one mentioned earlier (Thanks @stephenb !)

Between Elasticsearch versions Java went from 17 to 17.0.1 so it doesn't look like a big opportunity for that type of bug to get in there:

[root@ ~]# rpm -qa|grep elasticsearch
elasticsearch-7.15.1-1.x86_64
[root@ ~]# /usr/share/elasticsearch/jdk/bin/java -version
openjdk version "17" 2021-09-14
OpenJDK Runtime Environment Temurin-17+35 (build 17+35)
OpenJDK 64-Bit Server VM Temurin-17+35 (build 17+35, mixed mode, sharing)
[root@ ~]# rpm -qa|grep elasticsearch
elasticsearch-7.16.3-1.x86_64
[root@ ~]# /usr/share/elasticsearch/jdk/bin/java -version
openjdk version "17.0.1" 2021-10-19
OpenJDK Runtime Environment Temurin-17.0.1+12 (build 17.0.1+12)
OpenJDK 64-Bit Server VM Temurin-17.0.1+12 (build 17.0.1+12, mixed mode, sharing)

Anyhow, let's wait a bit and see if someone else bumps into this (I'm afraid LXD containers are not very common).

Thanks!
Juan

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.