I have a small elastic cluster running on Kubernetes 1.18 in AKS. It has 3 ingest nodes, 3 masters and 4 data nodes. I do not have much data, only a few gigabytes spread over a dozen indices. It has been setup using Helm 3 and elastic chart 6.8.12 [0]
When I run GET /_cat/nodes?v&s=load_1m:desc, you can see all my data nodes report 100% os.cpu and the load vary between 2-4.
ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.244.15.72 48 100 100 3.52 3.06 2.65 di - elasticsearch-test-data-2
10.244.0.237 59 100 100 3.21 3.03 2.87 di - elasticsearch-test-data-0
10.244.1.18 48 99 100 3.17 3.22 3.32 di - elasticsearch-test-data-1
10.244.17.9 69 100 100 2.72 2.67 3.02 di - elasticsearch-test-data-3
10.244.17.8 55 66 3 2.72 2.67 3.02 mi * elasticsearch-test-master-2
10.244.17.5 37 84 26 2.72 2.67 3.02 i - elasticsearch-test-client-2
10.244.18.42 41 81 28 0.62 0.70 0.63 i - elasticsearch-test-client-0
10.244.18.10 66 64 1 0.62 0.70 0.63 mi - elasticsearch-test-master-0
10.244.19.12 31 79 26 0.48 0.30 0.35 i - elasticsearch-test-client-1
10.244.19.51 18 64 1 0.48 0.30 0.35 mi - elasticsearch-test-master-1
I have confirmed the load values are likely correct in numerous ways by:
executing top inside the pods for the data nodes and confirmed they match the load values and that the java process is pretty much the same
executing top on the pods for the data nodes confirming they are pretty much the same
checking application insights reports on how much cpu my data nodes are using
I have configured Kubernetes resources to request 6 cpu and 4Gi RAM. Resource limits are 8 cpu and 4Gi RAM. I run on A8 machines with 8 cpus and I have confirmed the data nodes are each on its own kubernetes node with full access to the cpus.
I confirmed the os view of the cpus by checking /proc/cpuinfo. I have also confirmed that the data nodes each have 6 processors configured. I have monitored my data nodes and seen them go between 2 and 7 cpu usage based on load over this past week.
So, all my information points in the direction that the os.cpu number is incorrect. When checking /_cluster/stats/nodes/<node>, I get the process.cpu value, and it corresponds very well with the load values and the process cpu value from inside the pod. When checking /_nodes/<node>/stats I get both the os.cpu value (100) and the process.cpu value (looks fine).
When looking at hot threads or the thread pool, I see nothing that bothers me. Of course there are some hot threads on some of the nodes but nothing blocking or locking for any significant amount of time.
Is it perhaps a bug in elastic reporting? Am I having issues with the OS view of the VM while running in Kubernetes? Can it be troublesome that my data nodes also are ingest nodes? How can I continue troubleshooting and hopefully at some point sleep well? Am I just reading this cpu value in the wrong way?
As for the information under /proc/self/... I will have to do some more digging to connect the dots. I have not looked at this information before and need to learn more to understand what I am looking at.
I parsed the stats manually and did the calcultation by hand and it always comes up above 2. Which according to the code is then set to be 1 -> 100%.
It seems to me that calculation is off. It does not respect the quota. It checks for quota > 0 but then does not use it. In my case, the quota reported is 800000. With the period being 100000 that leaves 8 cpus. If I add my 8 cpu's into the elapsedNanos my cpu percentage comes back at 0.25-0.3 as expected.
I verified this number with a cpu usage script I've used in the past [0].
So there seems to be a mismatch on what the JDK expects the system to report and what is reported. Any thoughts? Bug in OS or bug in JDK?
As it's outside Elasticsearch, your guess is as good as mine I'm afraid. It could also be something specific to AKS, since I imagine this is using a somewhat customised OS too.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.