Data node with high os cpu usage but low process cpu usage in aws kubernetes

We are hosting elasticsearch cluster on the aws kubernetes. The cluster has 5 master nodes, 37 data nodes and 10 coordinate nodes.

We request/limit [memory: 64G][cpu: 25]for the kubernete container and give 30G memory to the data node.

es version: 6.8.1
jvm version:1.8.0_272",

Recently, we found all the data nodes has os cpu usage stuck at 100% while the process cpu is below 5%. See node status below.

{
  "_nodes": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "cluster_name": "metrics",
  "nodes": {
    "Wh_7IWLUQnu4t30cai2agg": {
      "timestamp": 1626321379127,
      "name": "****",
      "transport_address": "10.48.101.47:9300",
      "host": "****",
      "ip": "10.48.101.47:9300",
      "roles": [
        "data"
      ],
      "attributes": {
        "ml.machine_memory": "68719476736",
        "ml.max_open_jobs": "20",
        "xpack.installed": "true",
        "ml.enabled": "true"
      },
      "os": {
        "timestamp": 1626321379128,
        "cpu": {
          "percent": 100,
          "load_average": {
            "1m": 22.3,
            "5m": 13.1,
            "15m": 6.79
          }
        },
        "mem": {
          "total_in_bytes": 68719476736,
          "free_in_bytes": 40960,
          "used_in_bytes": 68719435776,
          "free_percent": 0,
          "used_percent": 100
        },
        "swap": {
          "total_in_bytes": 0,
          "free_in_bytes": 0,
          "used_in_bytes": 0
        }
      }
    }
  }
}

Any idea how to debug this issue?

Try the hot threads API to figure out where the CPU is spent.

Also, you should probably upgrade to a more recent Elasticsearch 6.8 version, if that above was not a typo.

What type of storage are you using?

We are using the following as storage

parameters:
fsType: ext4
type: gp2

Thanks for the suggestion @spinscale

The hot threads is either empty or showing something like 12~30% cpu usage by ***search.

We are consider upgrading, any other suggestions?

How much data do you have in the cluster? What is the size of your EBS gp2 volumes?

Here is a sample of allocation, the rest of data node are similar.


shards 	disk.indices	disk.used	disk.avail	disk.total	disk.percent	host	ip	node
56	231.1gb	239.5gb	768.2gb	1007.8gb	23	****	10.48.113.97	***
56	220gb	230.3gb	777.4gb	1007.8gb	22	****	10.48.108.121	***