We are hosting elasticsearch cluster on the aws kubernetes. The cluster has 5 master nodes, 37 data nodes and 10 coordinate nodes.
We request/limit [memory: 64G][cpu: 25]for the kubernete container and give 30G memory to the data node.
es version: 6.8.1
jvm version:1.8.0_272",
Recently, we found all the data nodes has os cpu usage stuck at 100% while the process cpu is below 5%. See node status below.
{
"_nodes": {
"total": 1,
"successful": 1,
"failed": 0
},
"cluster_name": "metrics",
"nodes": {
"Wh_7IWLUQnu4t30cai2agg": {
"timestamp": 1626321379127,
"name": "****",
"transport_address": "10.48.101.47:9300",
"host": "****",
"ip": "10.48.101.47:9300",
"roles": [
"data"
],
"attributes": {
"ml.machine_memory": "68719476736",
"ml.max_open_jobs": "20",
"xpack.installed": "true",
"ml.enabled": "true"
},
"os": {
"timestamp": 1626321379128,
"cpu": {
"percent": 100,
"load_average": {
"1m": 22.3,
"5m": 13.1,
"15m": 6.79
}
},
"mem": {
"total_in_bytes": 68719476736,
"free_in_bytes": 40960,
"used_in_bytes": 68719435776,
"free_percent": 0,
"used_percent": 100
},
"swap": {
"total_in_bytes": 0,
"free_in_bytes": 0,
"used_in_bytes": 0
}
}
}
}
}
Any idea how to debug this issue?