CPU usage increase after running for a while (CacheRecycler?)

Moreover, I thought about that over the weekend, I understand that
facetting could be cpu heavy, but the cpu usage increase over time and not
over query rate increase. This lead me to the fact, that somewhere, there
is some kind of contention or buildup...

I'll add logging and monitor through the day...

On Friday, February 22, 2013 2:26:23 PM UTC-5, Jérôme Gagnon wrote:

So does, reducing the # of threads would help ? There are ~3 shards per
node, so if you say that the size of thread pool is per lucene shard
instance, that mean 48 search threads ?

I know that facets can be cpu heavy, but I'm doing not an heavy number of
query per second either. I'm not sure what could help next.

As for the record, I am using Java 7 u11 which is pretty much the latest.

On Friday, February 22, 2013 1:09:05 PM UTC-5, kimchy wrote:

The number of push can't really be more then the number of pulls,
assuming you have a bounded concurrent requests happening (like a bounded
search thread pool). Both ConcurrentLinkedQueue/LinkedTransferQueue are non
blocking, which means they retry on "failure" (ala compareAndSet). Note,
Jorg point about size is not relevant, we don't call size.

What you see if really strange… . Can you try with a newer java version?
If you still suspect the ConcurrentLinkedQueue, you can try and use a
LinkedBlockingQueue instead and see if it helps.

Do you see the high CPU usage while driving the concurrent searches load?
It really might just be ok to see high CPU usage, and has nothing to do
with the queue itself. 16 concurrent shard level queries are allowed to
execute, if you push enough to keep on filling it, I suspect you will see
very high CPU usage (facets are typically very CPU intensive).

On Feb 22, 2013, at 6:15 PM, Jérôme Gagnon jerome....@gmail.com wrote:

We took another look at the CacheRecycler code and we saw that the poll
method actually removes the head.

Since our profiling now points to offer() for CPU time, we took a look at
the code of ConcurrentLinkedQueue and noticed that it tries to get the tail
and if it fails (tail was changed by another thread), then it scans the
whole list to find the new tail. So our conclusion is that either the size
of the queue is either very large or the number of concurrent access is so
high that it keeps rescanning the queue. If the second hypothesis
(concurrent access) was the problem, we should see the high CPU as soon as
we start the server, which is not the case. So now we think that the number
of push is greater than the number of pop which would cause the queue to
increase in size over time. Could that be the case? As a workaround, we are
thinking about capping the queue size.

What do you think?

On Friday, February 22, 2013 11:50:21 AM UTC-5, Jérôme Gagnon wrote:

We looked here at the CacheRecycler code, and we are not sure to see how
the queue size could be decreasing because in popIntArray there is only a
poll() called. So our theory is that the queue size is not bounded and the
process is looping indefinitely in that when it's trying to add a int.
Moreover, there seems to be a correlation between the CPU usage and the
number of field eviction on our servers... so that could potentially be
linked. Maybe you can help us with that ?

We are also not yet sure when those methods (pushIntArray and
popIntArray) are called. We added some logging to check the size of the
queue but we need to give time to see something in the log...

For the information you asked;

Bigdesk shows ~60 QPS.
Thread pool setting;
threadpool:
search:
type: fixed
size: 16
min: 1
queue_size: 64
reject_policy: abort

Our machines have 8 core (2 quad core CPU)

Thank you,

Jerome

On Friday, February 22, 2013 11:02:37 AM UTC-5, kimchy wrote:

How many concurrent requests are you executing? What is the search
thread pool sestinas? How many cores do you have?

On Feb 22, 2013, at 4:14 PM, Jérôme Gagnon jerome....@gmail.com
wrote:

Most of the queries are doing facets on a low cardinality int field
(5-6 differents values possible) with a facetFilter. I am not sure that
it's suposed to use that much cpu.

Moreover there still seems to be a contention somwhere, all my cpu are
gone up to 100% and query time is still increasing.

On Friday, February 22, 2013 9:48:03 AM UTC-5, kimchy wrote:

This might just be the CPU needed to compute the terms facet… . Might
be that the sampling done to get the hot threads end up coming up with the
addition to the queue...

On Feb 22, 2013, at 3:44 PM, Jérôme Gagnon jerome....@gmail.com
wrote:

Early update... contention seems to have moved from
LinkedTransferQueue to ConcurentLinkedQueue (no surprise here)

I am still seeing high cpu usage though, cluster is still running more
updates to come

gist:5013915 · GitHub

On Thursday, February 21, 2013 4:45:31 PM UTC-5, Jérôme Gagnon wrote:

By the way, thanks for the help, it's really appreciated !

On Thursday, February 21, 2013 3:32:50 PM UTC-5, Jérôme Gagnon wrote:

First, I'm pretty sure there is no heap pressure, the heap usage is
between 10 and 12go on 15go total and the gc pattern is neat.

And I'm forking it right now, I will let you know..

On Thursday, February 21, 2013 3:13:05 PM UTC-5, kimchy wrote:

This is really strange…, I suggest two things: First, are you sure
there is no memory pressure heap wise? Second, I pushed to 0.20 branch
updated version of those concurrent collections, maybe you can give a go
with it?

On Feb 21, 2013, at 8:04 PM, Jérôme Gagnon jerome....@gmail.com
wrote:

And for the whole cluster...

gist:5007186 · GitHub

On Thursday, February 21, 2013 1:54:48 PM UTC-5, Jérôme Gagnon
wrote:

Sure it's hapenning right now as a matter of fact...

ES cpu hot threads · GitHub

On Thursday, February 21, 2013 1:49:19 PM UTC-5, kimchy wrote:

Can you issue hot threads when you see the increased CPU usage
and gist it?

On Feb 21, 2013, at 5:32 PM, Jérôme Gagnon jerome....@gmail.com
wrote:

Edit2; We also reduced the search thread_pool size, since we
think that all the thread are trying to call the method posted up there,
and with a blocking call there is some kind of contention there, over time
the xfer and append cpu time is increasing)

We are running latest ES version (0.20.5) with java 7 u11

On Thursday, February 21, 2013 11:18:43 AM UTC-5, Jérôme Gagnon
wrote:

Edit; We believe this is not related to GC since CPU gc usage is
2-5% on all the nodes and heap is clean between 50-75% usage.

On Thursday, February 21, 2013 11:17:02 AM UTC-5, Jérôme Gagnon
wrote:

Hi everyone,

We are actually trying to put in prod our ES cluster, and we
are having some cpu usage issues after some uptime. When we start,
everything is running fine, but after a while we are experiencing an
increase in the cpu usage.

http://dl.dropbox.com/u/317367/cpu-day.png Here is a
screenshot of the cpu usage from last night while we experienced the issue
(in the middle)

We added a cache expiry time and found out that it helped. (We
are still currently running on this, but still, the cpu is increasing)

After doing some CPU profiling, we found out that CacheRecycler
is one of the issue that we have specially at;

at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.xfer(LinkedTransferQueue.java:606)
at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.add(LinkedTransferQueue.java:1049)
at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:470)
at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:460)

I added the thread dumps in attachment to this post. es7b is having the issue, while es1b not.

Thanks for any help.

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Some update about the issue; I still didn't found the reason why, but I'm
still getting CPU spike causing high latency and not recoverable cluster.
The cluster was stable (with some spikes, but I suspect merging) from 8AM
to 18:30PM yesterday, then suddenly the CPU spiked and latency went up, and
did not recover, on the end I had to remove the traffic because the cluster
did not recover. This is with the 0.20.6 version with
ConcurrentLinkedQueue, this seems to have slowed the buildup a bit, but
does not eliminate it at all. I'm going to attach some graph and the hot
threads dump of the cluster. I still hope to have answers about this one,
as it prevent me to go on production with my cluster...

Thank you,

Jerome

On Monday, February 25, 2013 9:07:17 AM UTC-5, Jérôme Gagnon wrote:

Moreover, I thought about that over the weekend, I understand that
facetting could be cpu heavy, but the cpu usage increase over time and not
over query rate increase. This lead me to the fact, that somewhere, there
is some kind of contention or buildup...

I'll add logging and monitor through the day...

On Friday, February 22, 2013 2:26:23 PM UTC-5, Jérôme Gagnon wrote:

So does, reducing the # of threads would help ? There are ~3 shards per
node, so if you say that the size of thread pool is per lucene shard
instance, that mean 48 search threads ?

I know that facets can be cpu heavy, but I'm doing not an heavy number of
query per second either. I'm not sure what could help next.

As for the record, I am using Java 7 u11 which is pretty much the latest.

On Friday, February 22, 2013 1:09:05 PM UTC-5, kimchy wrote:

The number of push can't really be more then the number of pulls,
assuming you have a bounded concurrent requests happening (like a bounded
search thread pool). Both ConcurrentLinkedQueue/LinkedTransferQueue are non
blocking, which means they retry on "failure" (ala compareAndSet). Note,
Jorg point about size is not relevant, we don't call size.

What you see if really strange… . Can you try with a newer java version?
If you still suspect the ConcurrentLinkedQueue, you can try and use a
LinkedBlockingQueue instead and see if it helps.

Do you see the high CPU usage while driving the concurrent searches
load? It really might just be ok to see high CPU usage, and has nothing to
do with the queue itself. 16 concurrent shard level queries are allowed to
execute, if you push enough to keep on filling it, I suspect you will see
very high CPU usage (facets are typically very CPU intensive).

On Feb 22, 2013, at 6:15 PM, Jérôme Gagnon jerome....@gmail.com wrote:

We took another look at the CacheRecycler code and we saw that the poll
method actually removes the head.

Since our profiling now points to offer() for CPU time, we took a look
at the code of ConcurrentLinkedQueue and noticed that it tries to get the
tail and if it fails (tail was changed by another thread), then it scans
the whole list to find the new tail. So our conclusion is that either the
size of the queue is either very large or the number of concurrent access
is so high that it keeps rescanning the queue. If the second hypothesis
(concurrent access) was the problem, we should see the high CPU as soon as
we start the server, which is not the case. So now we think that the number
of push is greater than the number of pop which would cause the queue to
increase in size over time. Could that be the case? As a workaround, we are
thinking about capping the queue size.

What do you think?

On Friday, February 22, 2013 11:50:21 AM UTC-5, Jérôme Gagnon wrote:

We looked here at the CacheRecycler code, and we are not sure to see
how the queue size could be decreasing because in popIntArray there is only
a poll() called. So our theory is that the queue size is not bounded and
the process is looping indefinitely in that when it's trying to add a
int. Moreover, there seems to be a correlation between the CPU usage and
the number of field eviction on our servers... so that could potentially be
linked. Maybe you can help us with that ?

We are also not yet sure when those methods (pushIntArray and
popIntArray) are called. We added some logging to check the size of the
queue but we need to give time to see something in the log...

For the information you asked;

Bigdesk shows ~60 QPS.
Thread pool setting;
threadpool:
search:
type: fixed
size: 16
min: 1
queue_size: 64
reject_policy: abort

Our machines have 8 core (2 quad core CPU)

Thank you,

Jerome

On Friday, February 22, 2013 11:02:37 AM UTC-5, kimchy wrote:

How many concurrent requests are you executing? What is the search
thread pool sestinas? How many cores do you have?

On Feb 22, 2013, at 4:14 PM, Jérôme Gagnon jerome....@gmail.com
wrote:

Most of the queries are doing facets on a low cardinality int field
(5-6 differents values possible) with a facetFilter. I am not sure that
it's suposed to use that much cpu.

Moreover there still seems to be a contention somwhere, all my cpu are
gone up to 100% and query time is still increasing.

On Friday, February 22, 2013 9:48:03 AM UTC-5, kimchy wrote:

This might just be the CPU needed to compute the terms facet… . Might
be that the sampling done to get the hot threads end up coming up with the
addition to the queue...

On Feb 22, 2013, at 3:44 PM, Jérôme Gagnon jerome....@gmail.com
wrote:

Early update... contention seems to have moved from
LinkedTransferQueue to ConcurentLinkedQueue (no surprise here)

I am still seeing high cpu usage though, cluster is still running
more updates to come

gist:5013915 · GitHub

On Thursday, February 21, 2013 4:45:31 PM UTC-5, Jérôme Gagnon wrote:

By the way, thanks for the help, it's really appreciated !

On Thursday, February 21, 2013 3:32:50 PM UTC-5, Jérôme Gagnon wrote:

First, I'm pretty sure there is no heap pressure, the heap usage is
between 10 and 12go on 15go total and the gc pattern is neat.

And I'm forking it right now, I will let you know..

On Thursday, February 21, 2013 3:13:05 PM UTC-5, kimchy wrote:

This is really strange…, I suggest two things: First, are you sure
there is no memory pressure heap wise? Second, I pushed to 0.20 branch
updated version of those concurrent collections, maybe you can give a go
with it?

On Feb 21, 2013, at 8:04 PM, Jérôme Gagnon jerome....@gmail.com
wrote:

And for the whole cluster...

gist:5007186 · GitHub

On Thursday, February 21, 2013 1:54:48 PM UTC-5, Jérôme Gagnon
wrote:

Sure it's hapenning right now as a matter of fact...

ES cpu hot threads · GitHub

On Thursday, February 21, 2013 1:49:19 PM UTC-5, kimchy wrote:

Can you issue hot threads when you see the increased CPU usage
and gist it?

On Feb 21, 2013, at 5:32 PM, Jérôme Gagnon jerome....@gmail.com
wrote:

Edit2; We also reduced the search thread_pool size, since we
think that all the thread are trying to call the method posted up there,
and with a blocking call there is some kind of contention there, over time
the xfer and append cpu time is increasing)

We are running latest ES version (0.20.5) with java 7 u11

On Thursday, February 21, 2013 11:18:43 AM UTC-5, Jérôme Gagnon
wrote:

Edit; We believe this is not related to GC since CPU gc usage
is 2-5% on all the nodes and heap is clean between 50-75% usage.

On Thursday, February 21, 2013 11:17:02 AM UTC-5, Jérôme Gagnon
wrote:

Hi everyone,

We are actually trying to put in prod our ES cluster, and we
are having some cpu usage issues after some uptime. When we start,
everything is running fine, but after a while we are experiencing an
increase in the cpu usage.

http://dl.dropbox.com/u/317367/cpu-day.png Here is a
screenshot of the cpu usage from last night while we experienced the issue
(in the middle)

We added a cache expiry time and found out that it helped. (We
are still currently running on this, but still, the cpu is increasing)

After doing some CPU profiling, we found out that
CacheRecycler is one of the issue that we have specially at;

at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.xfer(LinkedTransferQueue.java:606)
at org.elasticsearch.common.util.concurrent.jsr166y.LinkedTransferQueue.add(LinkedTransferQueue.java:1049)
at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:470)
at org.elasticsearch.common.CacheRecycler.pushIntArray(CacheRecycler.java:460)

I added the thread dumps in attachment to this post. es7b is having the issue, while es1b not.

Thanks for any help.

--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from
it, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.