We are hosting elasticsearch cluster on the aws kubernetes. The cluster has 5 master nodes, 37 data nodes and 10 coordinate nodes.
We request/limit [memory: 64G][cpu: 25]for the kubernete container and give 30G memory to the data node.
es version: 6.8.1
jvm version:1.8.0_272",
Recently, we found all the data nodes has os cpu usage stuck at 100% while the process cpu is below 5%. See node status below.
{
  "_nodes": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "cluster_name": "metrics",
  "nodes": {
    "Wh_7IWLUQnu4t30cai2agg": {
      "timestamp": 1626321379127,
      "name": "****",
      "transport_address": "10.48.101.47:9300",
      "host": "****",
      "ip": "10.48.101.47:9300",
      "roles": [
        "data"
      ],
      "attributes": {
        "ml.machine_memory": "68719476736",
        "ml.max_open_jobs": "20",
        "xpack.installed": "true",
        "ml.enabled": "true"
      },
      "os": {
        "timestamp": 1626321379128,
        "cpu": {
          "percent": 100,
          "load_average": {
            "1m": 22.3,
            "5m": 13.1,
            "15m": 6.79
          }
        },
        "mem": {
          "total_in_bytes": 68719476736,
          "free_in_bytes": 40960,
          "used_in_bytes": 68719435776,
          "free_percent": 0,
          "used_percent": 100
        },
        "swap": {
          "total_in_bytes": 0,
          "free_in_bytes": 0,
          "used_in_bytes": 0
        }
      }
    }
  }
}
Any idea how to debug this issue?