Elasticsearch version: 7.10.0 & 7.14.2 (but not reproduced in ES v7.6.1)
Plugins installed: None
JVM version: Bundled in official Elasticsearch docker image
OS version: Reproduced on CentOS 7.9 with kernel 3.10.0 and 5.4.155 ; as well as on Ubuntu 20.04 with kernel 5.4.0
OS version: Kubernetes 1.19.x & Kubernetes v1.21.5
Description of the problem including expected versus actual behavior:
Running an Elasticsearch 7.10.0+ cluster on Kubernetes (reproduced on two different K8s distributions) using ECK 1.8.0.
While ingesting documents in Elaticsearch and running performance tests, we noticed a high (and unusual) system CPU usage (between 20 and 30% while we are CPU bound, basically user cpu is around 60% and with very little IO-Wait).
We have reproduced this high system CPU usage with ES v7.10.0, as well as with ES v7.14.2 to check if it was still present in newer releases.
It does not seem to be OS specific nor from the OS kernel versions as described above. Similarly for Kubernetes version used, and storage layer (CSI) does not matter either. We have been using statically/local Persistent volumes and OpenEBS, and in all cases the high system CPU usage issue was present with Elasticsearch v7.10.0+.
However, if we have Elasticsearch on the same topology and same hardware but only running on Docker (so without Kubernetes), this high system CPU usage is not there, so it is not just a containerized ES issue.
Moreover we did not noticed this over a year ago, but we were using the Elasticsearch version released from back then, being v7.6.1. We have just tested it again with this old ES version and sure enough we do not reproduce the high system cpu usage under the exact same condition (same data, same hardware, same ES topology, same K8s cluster, same ECK, just changing the ES version from the manifest). Therefore, we believe that something changed in Elasticsearch between v7.6.1 and v7.10.0 that causes this high system CPU usage when running on top of Kubernetes.
Steps to reproduce:
- Deploy an ES v7.10.0+ cluster on Kubernetes with ECK and start ingesting enough documents to reach a CPU bound state (high CPU usage on Elasticsearch Data node). We used ESRally for this with eventdata track. Using one shard per vCores allocated of the ES cluster (ES Data pods). The number of ES Data nodes does not matter we have reproduce it on 2 nodes and 75 nodes clusters. Having dedicated master nodes or without ES master nodes does not matter either we have tested these scenarios.
- Check the different CPU usage with something like dstat -lrvn for instance and focus on system CPU value.
- You should see going above 20%.
- Now if you do the same test with ES v7.6.1 then the system cpu usage will be much lower.
This high system CPU usage (20-30%) prevents us to reach our previous ingestion rate we had last year.