Hi, guys, we are running an ES cluster with 7.1.1 in production, and everything looks good while one day, a data node's CPU jumps to 90%+ suddenly, and all of the queries related to the index on that nodes were time out. However, when we restarted the ES process on that machine, everything goes back to normal and the problem could not be reproduced again. We have faced twice of this problem on the same cluster and we have no idea when will this happen again. Could anyone tell me how could I debug this problem?
Here is some information related to the ES cluster:
ES version: 7.1.1
nodes: 3 Master Nodes, 34 Data nodes with 30g JVM, CMS gc algorithm, all of the machines are SSD.
The flame graph on that node when the CPU was high.
It seems that it is not related to any special queries. When this happened, any query could cause the CPU jumps to 90%+, and when restarted the ES process, we replayed the same queries and they didn't cause the phenomenon again.
Thanks for any help!