I have a small elastic cluster running on Kubernetes 1.18 in AKS. It has 3 ingest nodes, 3 masters and 4 data nodes. I do not have much data, only a few gigabytes spread over a dozen indices. It has been setup using Helm 3 and elastic chart 6.8.12 [0]
When I run GET /_cat/nodes?v&s=load_1m:desc
, you can see all my data nodes report 100% os.cpu
and the load vary between 2-4.
ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.244.15.72 48 100 100 3.52 3.06 2.65 di - elasticsearch-test-data-2
10.244.0.237 59 100 100 3.21 3.03 2.87 di - elasticsearch-test-data-0
10.244.1.18 48 99 100 3.17 3.22 3.32 di - elasticsearch-test-data-1
10.244.17.9 69 100 100 2.72 2.67 3.02 di - elasticsearch-test-data-3
10.244.17.8 55 66 3 2.72 2.67 3.02 mi * elasticsearch-test-master-2
10.244.17.5 37 84 26 2.72 2.67 3.02 i - elasticsearch-test-client-2
10.244.18.42 41 81 28 0.62 0.70 0.63 i - elasticsearch-test-client-0
10.244.18.10 66 64 1 0.62 0.70 0.63 mi - elasticsearch-test-master-0
10.244.19.12 31 79 26 0.48 0.30 0.35 i - elasticsearch-test-client-1
10.244.19.51 18 64 1 0.48 0.30 0.35 mi - elasticsearch-test-master-1
I have confirmed the load values are likely correct in numerous ways by:
-
executing
top
inside the pods for the data nodes and confirmed they match the load values and that the java process is pretty much the same -
executing
top
on the pods for the data nodes confirming they are pretty much the same -
checking application insights reports on how much cpu my data nodes are using
I have configured Kubernetes resources to request 6
cpu and 4Gi
RAM. Resource limits are 8
cpu and 4Gi
RAM. I run on A8 machines with 8 cpus and I have confirmed the data nodes are each on its own kubernetes node with full access to the cpus.
I confirmed the os view of the cpus by checking /proc/cpuinfo
. I have also confirmed that the data nodes each have 6
processors configured. I have monitored my data nodes and seen them go between 2 and 7 cpu usage based on load over this past week.
So, all my information points in the direction that the os.cpu
number is incorrect. When checking /_cluster/stats/nodes/<node>
, I get the process.cpu
value, and it corresponds very well with the load values and the process cpu value from inside the pod. When checking /_nodes/<node>/stats
I get both the os.cpu
value (100) and the process.cpu
value (looks fine).
When looking at hot threads or the thread pool, I see nothing that bothers me. Of course there are some hot threads on some of the nodes but nothing blocking or locking for any significant amount of time.
Is it perhaps a bug in elastic reporting? Am I having issues with the OS view of the VM while running in Kubernetes? Can it be troublesome that my data nodes also are ingest nodes? How can I continue troubleshooting and hopefully at some point sleep well? Am I just reading this cpu value in the wrong way?
[0] https://github.com/elastic/helm-charts/blob/6.8.12/elasticsearch/README.md