High system load causing performance issues


(Ant) #1

Hi

I'm noticing a latency spike on my elastic cluster when system interupts go over 8k a second. My knowledge of what that is is limited to me googling of the last hour and what I've come to realise is I don't really know enough to know how to solve this issue.

The nodes have 2 CPUs with 8 cores per CPU (so 16 cores total) with 64GB of Ram in each node running Centos 7, I'm not seeing high CPU though. My system load went up to 25 at one point over night but CPU load was only about 40%

I did think it might have been disk I/O but I've seen it occur in much shorter cases than I observed last night with negligable I/O and the instance that occured last night (which was a large event lasting an hour) saw the recorded disk I/O go way higher than it had previously occured when I'd seen previously slow downs that lasted a minute or so that had also triggered spikes in system interupts over 8k/s. I had thought the disks might have been a bit under powered and cuased the issue previously but now the spiked system interupts seems to be the more likely culprit as it's the only thing that seems consistent during the slow downs.

Some of the reading I found suggested that sytems seeing high interupts can be caused because the process is being split over CPUs and trying to use ram assigned to the other CPU. As I said the Server has 64GB of RAM so 32GB is on one CPU and 32GB is on the other. 31GB have been allocated to the JVM heap.

The monitoring product I use identified the additional CPU activity I saw (normally runs at 20% of total but spiked to 40% for this slow down) to be of the "wait" catagory which seems to lend some weight to this tieing back to NUMA.

I wondered if anyone else has experaince with this and can point me in the right direction to resolve it saving me going down a few rabbit holes?


(Mark Walkom) #2

Which is what exactly? It'll help us understand what other things you may not be monitoring.

That's a kinda weird metric to highlight, does it correlate exactly with the slow down?

What do the Elasticsearch logs show?
What version are you on?


(Ant) #3

Hi Mark,

The product is called observium, it polls every 5 mins and gathers stats from the box.

The more normal metric to highlight was system load, which did also spike, normally it's about 5 but in the smaller instances of issues I've seen it go to 10 and then when the system came under load the other day (volume of traffic is what we ultimatly attributied it to) system load went up to 25. In trying to dig into what was causing that system load that is what brought me to the system interupts as disk and CPU didn't show what conventionally I'd call high loads.

there are a few mentions in the logs of

Caused by: java.lang.IllegalStateException: [nested] nested object under path [xyz] is not of nested type

but this isn't contstant and there are only 5 instances of it in the logs fo the hour the system was under load.

I'm currently running 5.4.3


(Ant) #4

This is the best lead I can find as t what is going on although not elastic specific. As the server has 64GB of RAM odds are that some of the JVM heap (31GB) is being held in the RAM assosiated to one CPU and the rest to the other, as such (from what I've read) if a thread is passed to the CPU that doesn't hold the parts of the JVM that are requried it will need to fire a system interupt to pass the data to the CPUs local RAM pool which slows things down

That said I'm a little out of my depth on this one so my understanding of the situation may be flawed.


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.