juan.domenech
(Juan Domenech Fernandez)
January 17, 2022, 6:40pm
1
Problem:
In a 10 core CPU, 64GB RAM system, Centos 7.9 under LXD container virtualisation
Elasticsearch reports wrong CPU usage after upgrading to 7.16.3
This issue wasn't present on 7.15.1
We use this metric to track the health of the cluster but now it shows 100% CPU use most of the time
This metric has been confirmed to not be true using other OS CPU tools
Reproduce:
Upgrade to Elasticsearch 7.16.3
Give work to the cluster to increase the CPU usage
Obtain CPU usage via CAT API curl -uuser:pass https://localhost:9200/_cat/nodes?v
CAT API will report a sustained 100% CPU use
# curl -uuser:pass https://localhost:9200/_cat/nodes?v
ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.x.x.x 6 98 100 21.44 23.53 24.91 dhimrsw - nodeXXXX
10.x.x.x 48 100 100 18.33 17.49 16.85 dhimrsw - nodeXXXX
10.x.x.x 32 99 100 10.52 12.45 13.57 dhimrsw - nodeXXXX
10.x.x.x 21 98 100 29.13 33.93 34.91 dhimrsw * nodeXXXX
10.x.x.x 35 100 100 16.02 17.13 16.59 dhimrsw - nodeXXXX
10.x.x.x 47 97 100 32.28 30.24 30.42 dhimrsw - nodeXXXX
10.x.x.x 41 95 100 8.22 7.02 7.23 dhimrsw - nodeXXXX
10.x.x.x 19 99 100 14.26 13.40 13.54 dhimrsw - nodeXXXX
10.x.x.x 51 95 100 30.86 28.44 27.42 dhimrsw - nodeXXXX
10.x.x.x 4 96 100 24.88 22.34 21.95 dhimrsw - nodeXXXX
10.x.x.x 23 96 100 14.34 15.02 16.20 dhimrsw - nodeXXXX
10.x.x.x 19 95 100 22.19 20.98 20.42 dhimrsw - nodeXXXX
Most of the time each node will report 100% CPU.
We can see the effect on the metric after the patch and a rolling restart (the load in the cluster hasn't changed):
Are we aware of this issue? Is it happening to anyone else?
Thanks!
Juan
stephenb
(Stephen Brown)
January 17, 2022, 6:51pm
2
Hi @juan.domenech
What OS are you running on... is this Docker or Directly on the OS / VM?
juan.domenech
(Juan Domenech Fernandez)
January 17, 2022, 6:55pm
3
Hi!
This is Centos 7.9 in a VM (10 cores and 64GB RAM per node and 12 nodes/VMs in the cluster).
stephenb
(Stephen Brown)
January 17, 2022, 7:19pm
4
Couple more questions
Are you using the bundled JDK or your Own?
Also curious what VM solution are you using... or is this AWS ec2 etc.
juan.domenech
(Juan Domenech Fernandez)
January 17, 2022, 7:25pm
5
No problem!
Your question gave me an idea. We also have older Hypervisors running KVM (real VM). I'm going to test 7.16.3 there to see if the underlaying tech has something to do with this.
I'll be back!
juan.domenech
(Juan Domenech Fernandez)
January 17, 2022, 8:14pm
6
I can confirm that in a VM running a traditional virtualisation layer (KVM) this issue is not present. Elasticsearch 7.16.3 reports CPU usage correctly.
I'll update the original post accordingly.
This does not explain why this metric broke between versions but it is an important clue.
1 Like
stephenb
(Stephen Brown)
January 17, 2022, 8:33pm
7
Could perhaps be related to a bug we opened against OpenJDK but may not explain why this affected when you upgraded.
https://bugs.openjdk.java.net/browse/JDK-8248215
juan.domenech
(Juan Domenech Fernandez)
January 19, 2022, 5:21pm
8
I really don't know TBH.
But seeing this changing between versions (without changes in our OS) makes me think about a code change.
After a more detailed look to my graphs, I think that there is a bad calculation somewhere:
A node that in 7.15.1 was reporting around 7% CPU use, not reports around 70% with 7.16.3.
If we factor in that this nodes is a 10 Core LXD container it looks like the CPU metric is multiplied by the number of cores and capped at 100.
I see some recent changes on OsProbe.java in that area and I wonder if @rory.hunter could give us some directions
My changes were to support cGroups v2. If the OS was using v2, then no metrics would have been available at all before those changes, which doesn't appear to be the case. The JVM bug seems relevant though.
juan.domenech
(Juan Domenech Fernandez)
January 25, 2022, 8:43pm
10
Yes it looks like a bug but not sure is the one mentioned earlier (Thanks @stephenb !)
Between Elasticsearch versions Java went from 17 to 17.0.1 so it doesn't look like a big opportunity for that type of bug to get in there:
[root@ ~]# rpm -qa|grep elasticsearch
elasticsearch-7.15.1-1.x86_64
[root@ ~]# /usr/share/elasticsearch/jdk/bin/java -version
openjdk version "17" 2021-09-14
OpenJDK Runtime Environment Temurin-17+35 (build 17+35)
OpenJDK 64-Bit Server VM Temurin-17+35 (build 17+35, mixed mode, sharing)
[root@ ~]# rpm -qa|grep elasticsearch
elasticsearch-7.16.3-1.x86_64
[root@ ~]# /usr/share/elasticsearch/jdk/bin/java -version
openjdk version "17.0.1" 2021-10-19
OpenJDK Runtime Environment Temurin-17.0.1+12 (build 17.0.1+12)
OpenJDK 64-Bit Server VM Temurin-17.0.1+12 (build 17.0.1+12, mixed mode, sharing)
Anyhow, let's wait a bit and see if someone else bumps into this (I'm afraid LXD containers are not very common).
Thanks!
Juan
system
(system)
Closed
February 22, 2022, 8:43pm
11
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.