Interesting CPU load (without actually traffic load)


(jacque74) #1

Hello, every ones in a while we get into a state where one of our
servers reports high USER and System CPU, as indicated on this graph:

http://bit.ly/yAQyPe

As you can tell, the rest of the cluster is pretty much idle, while
img699 is continuously hot with CPU

top - 14:28:46 up 737 days, 18:21, 1 user, load average: 12.03,
10.37, 10.25
Tasks: 125 total, 1 running, 124 sleeping, 0 stopped, 0 zombie
Cpu(s): 34.9%us, 33.9%sy, 0.0%ni, 18.8%id, 11.7%wa, 0.2%hi,
0.5%si, 0.0%st
Mem: 16472372k total, 16387968k used, 84404k free, 5952k
buffers
Swap: 9775544k total, 5632k used, 9769912k free, 6111504k
cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
8625 root 20 0 9113m 8.4g 10m S 513.6 53.6 9021:31 java

Please see attached Jstack:

http://pastebin.com/HFUAn6ra

I am not really sure what its doing, all of the health status
indicators are idle, there is no merge or flush in progress, and left
alone, this server will be hot for days. The only way to resolve this
is to restart the process.

Shay, please let me know what you think.

-Jack


(Shay Banon) #2

Thanks for the stack trace. From what I can see, there are several on going stats requests on that node, and I can see when this might happen if a shard is being closed or was closed while they were executing (I can't see from the stack trace if thats the case). I fixed it here: https://github.com/elasticsearch/elasticsearch/issues/1772. Otherwise, the only other thing that I can think is that the networking lib is causing it (there were some bugs related to that in older versions, though they have been fixed and I its not evident that tis happening from the stack trace).

On Friday, March 9, 2012 at 12:31 AM, Jack Levin wrote:

Hello, every ones in a while we get into a state where one of our
servers reports high USER and System CPU, as indicated on this graph:

http://bit.ly/yAQyPe

As you can tell, the rest of the cluster is pretty much idle, while
img699 is continuously hot with CPU

top - 14:28:46 up 737 days, 18:21, 1 user, load average: 12.03,
10.37, 10.25
Tasks: 125 total, 1 running, 124 sleeping, 0 stopped, 0 zombie
Cpu(s): 34.9%us, 33.9%sy, 0.0%ni, 18.8%id, 11.7%wa, 0.2%hi,
0.5%si, 0.0%st
Mem: 16472372k total, 16387968k used, 84404k free, 5952k
buffers
Swap: 9775544k total, 5632k used, 9769912k free, 6111504k
cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
8625 root 20 0 9113m 8.4g 10m S 513.6 53.6 9021:31 java

Please see attached Jstack:

http://pastebin.com/HFUAn6ra

I am not really sure what its doing, all of the health status
indicators are idle, there is no merge or flush in progress, and left
alone, this server will be hot for days. The only way to resolve this
is to restart the process.

Shay, please let me know what you think.

-Jack


(jacque74) #3

Shay, is there a stable version of ES I can get with this Fix?
Otherwise where can I get it?

-jack

On Mar 8, 4:04 pm, Shay Banon kim...@gmail.com wrote:

Thanks for the stack trace. From what I can see, there are several on going stats requests on that node, and I can see when this might happen if a shard is being closed or was closed while they were executing (I can't see from the stack trace if thats the case). I fixed it here:https://github.com/elasticsearch/elasticsearch/issues/1772. Otherwise, the only other thing that I can think is that the networking lib is causing it (there were some bugs related to that in older versions, though they have been fixed and I its not evident that tis happening from the stack trace).

On Friday, March 9, 2012 at 12:31 AM, Jack Levin wrote:

Hello, every ones in a while we get into a state where one of our
servers reports high USER and System CPU, as indicated on this graph:

http://bit.ly/yAQyPe

As you can tell, the rest of the cluster is pretty much idle, while
img699 is continuously hot with CPU

top - 14:28:46 up 737 days, 18:21, 1 user, load average: 12.03,
10.37, 10.25
Tasks: 125 total, 1 running, 124 sleeping, 0 stopped, 0 zombie
Cpu(s): 34.9%us, 33.9%sy, 0.0%ni, 18.8%id, 11.7%wa, 0.2%hi,
0.5%si, 0.0%st
Mem: 16472372k total, 16387968k used, 84404k free, 5952k
buffers
Swap: 9775544k total, 5632k used, 9769912k free, 6111504k
cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
8625 root 20 0 9113m 8.4g 10m S 513.6 53.6 9021:31 java

Please see attached Jstack:

http://pastebin.com/HFUAn6ra

I am not really sure what its doing, all of the health status
indicators are idle, there is no merge or flush in progress, and left
alone, this server will be hot for days. The only way to resolve this
is to restart the process.

Shay, please let me know what you think.

-Jack


(Shay Banon) #4

The fix has been applied to both master and 0.19 branch. It will be part of 0.19.1 release (released either this week or the next). You can easily build a version yourself to test it, just clone / download the 0.19 branch from github, and run "mvn package -DskipTests", the distribution files will be under target/release.

On Saturday, March 10, 2012 at 4:37 AM, Jack Levin wrote:

Shay, is there a stable version of ES I can get with this Fix?
Otherwise where can I get it?

-jack

On Mar 8, 4:04 pm, Shay Banon <kim...@gmail.com (http://gmail.com)> wrote:

Thanks for the stack trace. From what I can see, there are several on going stats requests on that node, and I can see when this might happen if a shard is being closed or was closed while they were executing (I can't see from the stack trace if thats the case). I fixed it here:https://github.com/elasticsearch/elasticsearch/issues/1772. Otherwise, the only other thing that I can think is that the networking lib is causing it (there were some bugs related to that in older versions, though they have been fixed and I its not evident that tis happening from the stack trace).

On Friday, March 9, 2012 at 12:31 AM, Jack Levin wrote:

Hello, every ones in a while we get into a state where one of our
servers reports high USER and System CPU, as indicated on this graph:

http://bit.ly/yAQyPe

As you can tell, the rest of the cluster is pretty much idle, while
img699 is continuously hot with CPU

top - 14:28:46 up 737 days, 18:21, 1 user, load average: 12.03,
10.37, 10.25
Tasks: 125 total, 1 running, 124 sleeping, 0 stopped, 0 zombie
Cpu(s): 34.9%us, 33.9%sy, 0.0%ni, 18.8%id, 11.7%wa, 0.2%hi,
0.5%si, 0.0%st
Mem: 16472372k total, 16387968k used, 84404k free, 5952k
buffers
Swap: 9775544k total, 5632k used, 9769912k free, 6111504k
cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
8625 root 20 0 9113m 8.4g 10m S 513.6 53.6 9021:31 java

Please see attached Jstack:

http://pastebin.com/HFUAn6ra

I am not really sure what its doing, all of the health status
indicators are idle, there is no merge or flush in progress, and left
alone, this server will be hot for days. The only way to resolve this
is to restart the process.

Shay, please let me know what you think.

-Jack


(system) #5