I'm noticing a latency spike on my elastic cluster when system interupts go over 8k a second. My knowledge of what that is is limited to me googling of the last hour and what I've come to realise is I don't really know enough to know how to solve this issue.
The nodes have 2 CPUs with 8 cores per CPU (so 16 cores total) with 64GB of Ram in each node running Centos 7, I'm not seeing high CPU though. My system load went up to 25 at one point over night but CPU load was only about 40%
I did think it might have been disk I/O but I've seen it occur in much shorter cases than I observed last night with negligable I/O and the instance that occured last night (which was a large event lasting an hour) saw the recorded disk I/O go way higher than it had previously occured when I'd seen previously slow downs that lasted a minute or so that had also triggered spikes in system interupts over 8k/s. I had thought the disks might have been a bit under powered and cuased the issue previously but now the spiked system interupts seems to be the more likely culprit as it's the only thing that seems consistent during the slow downs.
Some of the reading I found suggested that sytems seeing high interupts can be caused because the process is being split over CPUs and trying to use ram assigned to the other CPU. As I said the Server has 64GB of RAM so 32GB is on one CPU and 32GB is on the other. 31GB have been allocated to the JVM heap.
The monitoring product I use identified the additional CPU activity I saw (normally runs at 20% of total but spiked to 40% for this slow down) to be of the "wait" catagory which seems to lend some weight to this tieing back to NUMA.
I wondered if anyone else has experaince with this and can point me in the right direction to resolve it saving me going down a few rabbit holes?