Single thread with high CPU usage


(Magnus Hyllander) #1

I have an ES 0.90.11 cluster with three nodes (d0, d1, d2), with 4 cores
and 7GB memory, running Ubuntu and JDK 7u45. The ES instances are all
master+data, configured with 3.5GB heap size. They are pretty much running
a vanilla configuration. Logstash is currently storing on average 200 logs
per second to the cluster, and we use kibana as a frontend. Usually when
teh cluster is started the nodes run at around 20% cpu. However after some
time, one or more of the nodes will jump up to around 90-100% cpu. And
there they stay for what appears to be forever (until I tire and restart
them).

Using "top -H" I can see that there is one thread in each elasticsearch
process that is using most of the cpu. Here are examples from two of the
nodes:

Node d1:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
41969 elastic 20 0 5814m 3.5g 11m R 82.8 52.0 1036:30 java
45601 elastic 20 0 5814m 3.5g 11m S 31.9 52.0 23:02.45 java
41965 elastic 20 0 5814m 3.5g 11m S 19.1 52.0 25:25.97 java
41966 elastic 20 0 5814m 3.5g 11m S 12.7 52.0 25:25.95 java
41967 elastic 20 0 5814m 3.5g 11m S 12.7 52.0 25:23.10 java
41968 elastic 20 0 5814m 3.5g 11m S 12.7 52.0 25:23.27 java
45810 elastic 20 0 5814m 3.5g 11m S 6.4 52.0 22:59.55 java

Node d2:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
40604 elastic 20 0 5812m 3.6g 11m R 99.9 53.2 926:23.96 java
41487 elastic 20 0 5812m 3.6g 11m S 6.5 53.2 4:35.11 java
42443 elastic 20 0 5812m 3.6g 11m S 6.5 53.2 47:03.65 java
42446 elastic 20 0 5812m 3.6g 11m S 6.5 53.2 47:05.12 java
42447 elastic 20 0 5812m 3.6g 11m S 6.5 53.2 46:38.30 java
31827 elastic 20 0 5812m 3.6g 11m S 6.5 53.2 0:00.59 java

As you can see there is one thread in each process that seems to be
running amok.

I have tried to use the _nodes/hot_threads API to see which thread is using
the cpu, but I can't identify any single thread with the same cpu
percentage that top reports. In addition, I have tried using jstack to dump
the threads, but the stack dump doesn't even list the thread with the
thread PID from top.

Here are a couple of charts showing the cpu user percentage:

https://lh5.googleusercontent.com/-Clcdm5Zh5Ps/Uw4YZLI6BmI/AAAAAAAAEVE/eYINhJP3ACo/s1600/Image.png

As you can see all the nodes went from 20% to 100% at around 3 PM. At
midnight I got tired of waiting and restarted ES, one node at a time.

The next chart is from some hours later:

https://lh6.googleusercontent.com/-j5Fb3d-GxHU/Uw4YdvfNbaI/AAAAAAAAEVM/3c1g-ztRA18/s1600/Image.png

In this case the nodes' cpu usage increased at different points in time.

Cpu iowait remains low (5-10%) the whole time.

I'm thinking that maybe this behavior is triggered by large queries, but I
don't have a specific test case that triggers it.

So, what can I do to find out what is going on? Any help would be greatly
appreciated!

Regards,
Magnus Hyllander

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6af1c79d-8402-4de6-9ec2-07893c6b54f2%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Nik Everett) #2

Check to see how much GC you are doing when it spikes. If it is high, try
to clear the cache:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-clearcache.html
I'd try clearing each cache one at a time to see which one helps. If that
is the problem you can configure Elasticsearch to limit the size of those
caches to some percent of heap.

Nik

On Wed, Feb 26, 2014 at 11:41 AM, Magnus Hyllander <
magnus.hyllander@gmail.com> wrote:

I have an ES 0.90.11 cluster with three nodes (d0, d1, d2), with 4 cores
and 7GB memory, running Ubuntu and JDK 7u45. The ES instances are all
master+data, configured with 3.5GB heap size. They are pretty much running
a vanilla configuration. Logstash is currently storing on average 200 logs
per second to the cluster, and we use kibana as a frontend. Usually when
teh cluster is started the nodes run at around 20% cpu. However after some
time, one or more of the nodes will jump up to around 90-100% cpu. And
there they stay for what appears to be forever (until I tire and restart
them).

Using "top -H" I can see that there is one thread in each elasticsearch
process that is using most of the cpu. Here are examples from two of the
nodes:

Node d1:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
41969 elastic 20 0 5814m 3.5g 11m R 82.8 52.0 1036:30 java
45601 elastic 20 0 5814m 3.5g 11m S 31.9 52.0 23:02.45 java
41965 elastic 20 0 5814m 3.5g 11m S 19.1 52.0 25:25.97 java
41966 elastic 20 0 5814m 3.5g 11m S 12.7 52.0 25:25.95 java
41967 elastic 20 0 5814m 3.5g 11m S 12.7 52.0 25:23.10 java
41968 elastic 20 0 5814m 3.5g 11m S 12.7 52.0 25:23.27 java
45810 elastic 20 0 5814m 3.5g 11m S 6.4 52.0 22:59.55 java

Node d2:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
40604 elastic 20 0 5812m 3.6g 11m R 99.9 53.2 926:23.96 java
41487 elastic 20 0 5812m 3.6g 11m S 6.5 53.2 4:35.11 java
42443 elastic 20 0 5812m 3.6g 11m S 6.5 53.2 47:03.65 java
42446 elastic 20 0 5812m 3.6g 11m S 6.5 53.2 47:05.12 java
42447 elastic 20 0 5812m 3.6g 11m S 6.5 53.2 46:38.30 java
31827 elastic 20 0 5812m 3.6g 11m S 6.5 53.2 0:00.59 java

As you can see there is one thread in each process that seems to be
running amok.

I have tried to use the _nodes/hot_threads API to see which thread is
using the cpu, but I can't identify any single thread with the same cpu
percentage that top reports. In addition, I have tried using jstack to dump
the threads, but the stack dump doesn't even list the thread with the
thread PID from top.

Here are a couple of charts showing the cpu user percentage:

https://lh5.googleusercontent.com/-Clcdm5Zh5Ps/Uw4YZLI6BmI/AAAAAAAAEVE/eYINhJP3ACo/s1600/Image.png

As you can see all the nodes went from 20% to 100% at around 3 PM. At
midnight I got tired of waiting and restarted ES, one node at a time.

The next chart is from some hours later:

https://lh6.googleusercontent.com/-j5Fb3d-GxHU/Uw4YdvfNbaI/AAAAAAAAEVM/3c1g-ztRA18/s1600/Image.png

In this case the nodes' cpu usage increased at different points in time.

Cpu iowait remains low (5-10%) the whole time.

I'm thinking that maybe this behavior is triggered by large queries, but I
don't have a specific test case that triggers it.

So, what can I do to find out what is going on? Any help would be greatly
appreciated!

Regards,
Magnus Hyllander

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/6af1c79d-8402-4de6-9ec2-07893c6b54f2%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd2w7MEmj7HhC8uBbrvqgKyESXvMeSg7yzMcVTeRZXAi7Q%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Magnus Hyllander) #3

Hi Nik,

Thanks, that was spot on! Clearing the field cache immediately caused a
drop in cpu usage to normal. If I understand the documenation correctly the
field cache size is unbounded per default, so I'll look into setting a
limit. It seems you can't set the field cache size limit using the cluster
settings API, so I guess I'll have to set it in the configuration file and
restart the ES instances.

Thanks again!

Magnus

On Wednesday, February 26, 2014 5:46:52 PM UTC+1, Nikolas Everett wrote:

Check to see how much GC you are doing when it spikes. If it is high, try
to clear the cache:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-clearcache.html
I'd try clearing each cache one at a time to see which one helps. If that
is the problem you can configure Elasticsearch to limit the size of those
caches to some percent of heap.

Nik

On Wed, Feb 26, 2014 at 11:41 AM, Magnus Hyllander <magnus.h...@gmail.com<javascript:>

wrote:

I have an ES 0.90.11 cluster with three nodes (d0, d1, d2), with 4 cores
and 7GB memory, running Ubuntu and JDK 7u45. The ES instances are all
master+data, configured with 3.5GB heap size. They are pretty much running
a vanilla configuration. Logstash is currently storing on average 200 logs
per second to the cluster, and we use kibana as a frontend. Usually when
teh cluster is started the nodes run at around 20% cpu. However after some
time, one or more of the nodes will jump up to around 90-100% cpu. And
there they stay for what appears to be forever (until I tire and restart
them).

Using "top -H" I can see that there is one thread in each elasticsearch
process that is using most of the cpu. Here are examples from two of the
nodes:

Node d1:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
41969 elastic 20 0 5814m 3.5g 11m R 82.8 52.0 1036:30 java
45601 elastic 20 0 5814m 3.5g 11m S 31.9 52.0 23:02.45 java
41965 elastic 20 0 5814m 3.5g 11m S 19.1 52.0 25:25.97 java
41966 elastic 20 0 5814m 3.5g 11m S 12.7 52.0 25:25.95 java
41967 elastic 20 0 5814m 3.5g 11m S 12.7 52.0 25:23.10 java
41968 elastic 20 0 5814m 3.5g 11m S 12.7 52.0 25:23.27 java
45810 elastic 20 0 5814m 3.5g 11m S 6.4 52.0 22:59.55 java

Node d2:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
40604 elastic 20 0 5812m 3.6g 11m R 99.9 53.2 926:23.96 java
41487 elastic 20 0 5812m 3.6g 11m S 6.5 53.2 4:35.11 java
42443 elastic 20 0 5812m 3.6g 11m S 6.5 53.2 47:03.65 java
42446 elastic 20 0 5812m 3.6g 11m S 6.5 53.2 47:05.12 java
42447 elastic 20 0 5812m 3.6g 11m S 6.5 53.2 46:38.30 java
31827 elastic 20 0 5812m 3.6g 11m S 6.5 53.2 0:00.59 java

As you can see there is one thread in each process that seems to be
running amok.

I have tried to use the _nodes/hot_threads API to see which thread is
using the cpu, but I can't identify any single thread with the same cpu
percentage that top reports. In addition, I have tried using jstack to dump
the threads, but the stack dump doesn't even list the thread with the
thread PID from top.

Here are a couple of charts showing the cpu user percentage:

https://lh5.googleusercontent.com/-Clcdm5Zh5Ps/Uw4YZLI6BmI/AAAAAAAAEVE/eYINhJP3ACo/s1600/Image.png

As you can see all the nodes went from 20% to 100% at around 3 PM. At
midnight I got tired of waiting and restarted ES, one node at a time.

The next chart is from some hours later:

https://lh6.googleusercontent.com/-j5Fb3d-GxHU/Uw4YdvfNbaI/AAAAAAAAEVM/3c1g-ztRA18/s1600/Image.png

In this case the nodes' cpu usage increased at different points in time.

Cpu iowait remains low (5-10%) the whole time.

I'm thinking that maybe this behavior is triggered by large queries, but
I don't have a specific test case that triggers it.

So, what can I do to find out what is going on? Any help would be greatly
appreciated!

Regards,
Magnus Hyllander

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/6af1c79d-8402-4de6-9ec2-07893c6b54f2%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/65c9dd2e-e441-4fe4-ada0-c517b55336f9%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #4