I have an ES 0.90.11 cluster with three nodes (d0, d1, d2), with 4 cores
and 7GB memory, running Ubuntu and JDK 7u45. The ES instances are all
master+data, configured with 3.5GB heap size. They are pretty much running
a vanilla configuration. Logstash is currently storing on average 200 logs
per second to the cluster, and we use kibana as a frontend. Usually when
teh cluster is started the nodes run at around 20% cpu. However after some
time, one or more of the nodes will jump up to around 90-100% cpu. And
there they stay for what appears to be forever (until I tire and restart
them).
Using "top -H" I can see that there is one thread in each elasticsearch
process that is using most of the cpu. Here are examples from two of the
nodes:
Node d1:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
41969 elastic 20 0 5814m 3.5g 11m R 82.8 52.0 1036:30 java
45601 elastic 20 0 5814m 3.5g 11m S 31.9 52.0 23:02.45 java
41965 elastic 20 0 5814m 3.5g 11m S 19.1 52.0 25:25.97 java
41966 elastic 20 0 5814m 3.5g 11m S 12.7 52.0 25:25.95 java
41967 elastic 20 0 5814m 3.5g 11m S 12.7 52.0 25:23.10 java
41968 elastic 20 0 5814m 3.5g 11m S 12.7 52.0 25:23.27 java
45810 elastic 20 0 5814m 3.5g 11m S 6.4 52.0 22:59.55 java
Node d2:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
40604 elastic 20 0 5812m 3.6g 11m R 99.9 53.2 926:23.96 java
41487 elastic 20 0 5812m 3.6g 11m S 6.5 53.2 4:35.11 java
42443 elastic 20 0 5812m 3.6g 11m S 6.5 53.2 47:03.65 java
42446 elastic 20 0 5812m 3.6g 11m S 6.5 53.2 47:05.12 java
42447 elastic 20 0 5812m 3.6g 11m S 6.5 53.2 46:38.30 java
31827 elastic 20 0 5812m 3.6g 11m S 6.5 53.2 0:00.59 java
As you can see there is one thread in each process that seems to be
running amok.
I have tried to use the _nodes/hot_threads API to see which thread is using
the cpu, but I can't identify any single thread with the same cpu
percentage that top reports. In addition, I have tried using jstack to dump
the threads, but the stack dump doesn't even list the thread with the
thread PID from top.
Here are a couple of charts showing the cpu user percentage:
https://lh5.googleusercontent.com/-Clcdm5Zh5Ps/Uw4YZLI6BmI/AAAAAAAAEVE/eYINhJP3ACo/s1600/Image.png
As you can see all the nodes went from 20% to 100% at around 3 PM. At
midnight I got tired of waiting and restarted ES, one node at a time.
The next chart is from some hours later:
https://lh6.googleusercontent.com/-j5Fb3d-GxHU/Uw4YdvfNbaI/AAAAAAAAEVM/3c1g-ztRA18/s1600/Image.png
In this case the nodes' cpu usage increased at different points in time.
Cpu iowait remains low (5-10%) the whole time.
I'm thinking that maybe this behavior is triggered by large queries, but I
don't have a specific test case that triggers it.
So, what can I do to find out what is going on? Any help would be greatly
appreciated!
Regards,
Magnus Hyllander
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6af1c79d-8402-4de6-9ec2-07893c6b54f2%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.