I have a small elastic cluster running on Kubernetes 1.18 in AKS. It has 3 ingest nodes, 3 masters and 4 data nodes. I do not have much data, only a few gigabytes spread over a dozen indices. It has been setup using Helm 3 and elastic chart 6.8.12 
When I run
GET /_cat/nodes?v&s=load_1m:desc, you can see all my data nodes report 100%
os.cpu and the load vary between 2-4.
ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name 10.244.15.72 48 100 100 3.52 3.06 2.65 di - elasticsearch-test-data-2 10.244.0.237 59 100 100 3.21 3.03 2.87 di - elasticsearch-test-data-0 10.244.1.18 48 99 100 3.17 3.22 3.32 di - elasticsearch-test-data-1 10.244.17.9 69 100 100 2.72 2.67 3.02 di - elasticsearch-test-data-3 10.244.17.8 55 66 3 2.72 2.67 3.02 mi * elasticsearch-test-master-2 10.244.17.5 37 84 26 2.72 2.67 3.02 i - elasticsearch-test-client-2 10.244.18.42 41 81 28 0.62 0.70 0.63 i - elasticsearch-test-client-0 10.244.18.10 66 64 1 0.62 0.70 0.63 mi - elasticsearch-test-master-0 10.244.19.12 31 79 26 0.48 0.30 0.35 i - elasticsearch-test-client-1 10.244.19.51 18 64 1 0.48 0.30 0.35 mi - elasticsearch-test-master-1
I have confirmed the load values are likely correct in numerous ways by:
topinside the pods for the data nodes and confirmed they match the load values and that the java process is pretty much the same
topon the pods for the data nodes confirming they are pretty much the same
checking application insights reports on how much cpu my data nodes are using
I have configured Kubernetes resources to request
6 cpu and
4Gi RAM. Resource limits are
8 cpu and
4Gi RAM. I run on A8 machines with 8 cpus and I have confirmed the data nodes are each on its own kubernetes node with full access to the cpus.
I confirmed the os view of the cpus by checking
/proc/cpuinfo. I have also confirmed that the data nodes each have
6 processors configured. I have monitored my data nodes and seen them go between 2 and 7 cpu usage based on load over this past week.
So, all my information points in the direction that the
os.cpu number is incorrect. When checking
/_cluster/stats/nodes/<node>, I get the
process.cpu value, and it corresponds very well with the load values and the process cpu value from inside the pod. When checking
/_nodes/<node>/stats I get both the
os.cpu value (100) and the
process.cpu value (looks fine).
When looking at hot threads or the thread pool, I see nothing that bothers me. Of course there are some hot threads on some of the nodes but nothing blocking or locking for any significant amount of time.
Is it perhaps a bug in elastic reporting? Am I having issues with the OS view of the VM while running in Kubernetes? Can it be troublesome that my data nodes also are ingest nodes? How can I continue troubleshooting and hopefully at some point sleep well? Am I just reading this cpu value in the wrong way?