The only APIs we use are the Index and the Search API. We used to have
long-request problems because we'd send Search requests with 1000+
facets but we rewrote that code recently to use one facet (histogram)
and it's been smooth sailing then. However, the CPU spikes persist and
every now and then one of our calls to ES will time out. (We use NEST
the C# client to talk to ES.)
We're running ES on an eight-gigabyte, four-core virtual machine. Here
are our node stats:
It's using the standard configuration (1 node, 1 replica, 10 shards
total) that comes out of the box, and we're running ES via the service
script and conf file that came with ES. What do you guys think we
should do to make these spikes go away? Should we be running more
nodes (if so, a node on a different box or a node on the same box)?
Any insight into this problem would be super-appreciated.
And I don't know if this is relevant at all, but the only way I've
been able to simulate a spike is by POSTing to _cache/clear. (That's
why you see three spikes back-to-back in that graph toward the end;
that's me, poking around, sending cache-clear POSTs.)
The only APIs we use are the Index and the Search API. We used to have
long-request problems because we'd send Search requests with 1000+
facets but we rewrote that code recently to use one facet (histogram)
and it's been smooth sailing then. However, the CPU spikes persist and
every now and then one of our calls to ES will time out. (We use NEST
the C# client to talk to ES.)
We're running ES on an eight-gigabyte, four-core virtual machine. Here
are our node stats:
It's using the standard configuration (1 node, 1 replica, 10 shards
total) that comes out of the box, and we're running ES via the service
script and conf file that came with ES. What do you guys think we
should do to make these spikes go away? Should we be running more
nodes (if so, a node on a different box or a node on the same box)?
Any insight into this problem would be super-appreciated.
And I don't know if this is relevant at all, but the only way I've
been able to simulate a spike is by POSTing to _cache/clear. (That's
why you see three spikes back-to-back in that graph toward the end;
that's me, poking around, sending cache-clear POSTs.)
You sent the node info and not node stats, can you send the node stats as well? From the node info, I can already see something things that you should do:
Set the min and max memory to be the same. And set to ~4gb out of the machine total 8gb.
Use a more recent JVM, the one you are using is old.
Are you running on windows or *nix? If *unix, enable mlockall (so the java process will not swap).
Install big desk, and see what happens when the CPU spikes, does the GC time / count increases around the same interval?
It might be internal merges going on, you can tell if merges are happening using the node stats API, and how long they took (hopefully we will have this graphed in big desk soon).
On Thursday, February 2, 2012 at 1:43 AM, Hao Lian wrote:
Hi!
We've been load-testing Elastic Search before we release to production
(and we love it!) and we've been noticing periodic CPU spikes:
The only APIs we use are the Index and the Search API. We used to have
long-request problems because we'd send Search requests with 1000+
facets but we rewrote that code recently to use one facet (histogram)
and it's been smooth sailing then. However, the CPU spikes persist and
every now and then one of our calls to ES will time out. (We use NEST
the C# client to talk to ES.)
We're running ES on an eight-gigabyte, four-core virtual machine. Here
are our node stats:
It's using the standard configuration (1 node, 1 replica, 10 shards
total) that comes out of the box, and we're running ES via the service
script and conf file that came with ES. What do you guys think we
should do to make these spikes go away? Should we be running more
nodes (if so, a node on a different box or a node on the same box)?
Any insight into this problem would be super-appreciated.
And I don't know if this is relevant at all, but the only way I've
been able to simulate a spike is by POSTing to _cache/clear. (That's
why you see three spikes back-to-back in that graph toward the end;
that's me, poking around, sending cache-clear POSTs.)
Not sure which ES version you are using, so just note about bigdesk, it
does not support 0.19.0 snapshot ES version very well, there are a lot of
breaking changes in ES master, but I am working on a fix as we speak...
On Thu, Feb 2, 2012 at 8:43 PM, Hao Lian hao@fogcreek.com wrote:
Thanks Otis and Shay, we'll get right on following those leads.
OK, we think we fixed it, and I just wanted to report a little about
what we did. We couldn't upgrade to JVM 7 because it would've been too
big a drain on our sysadmin resources, but we did solve the problem.
Setting ES_MIN_MEM = ES_MAX_MEM helped substantially in that they
made it much easier to see what was going on with the Java GC.
BigDesk is awesome! Way better than just running curl over and over.
jstat is awesome!
It turns out that ES on SurvivorRatio=8 with UseParNewGC would
slowly fill up eden space until the GC paused and moved those objects
to S0 or S1. This was causing all our CPU spikes because our eden
space was 2 GB given that we have a 6 GB heap and the default old:new
ratio is 1:2. Once we realized this, we bounded the eden space with
NewSize and MaxNewSize to be 320 MB (down from 2 GB). Now GC happens
more periodically but faster, saturating at most one of our cores
rather than all four. We've got more tuning to do, but for now ES
remains responsive during these smaller but more frequent collections,
which makes us happy.
Thanks for everybody's help!
Hao Lian
Fog Creek Software
You don't have to move to Java 7, but can you try and run with a newer Java 6 version? u18 is very old.
Where did you see that the value of the new generation was 2gb? Can you gist it?
On Friday, February 3, 2012 at 7:21 PM, Hao Lian wrote:
OK, we think we fixed it, and I just wanted to report a little about
what we did. We couldn't upgrade to JVM 7 because it would've been too
big a drain on our sysadmin resources, but we did solve the problem.
Setting ES_MIN_MEM = ES_MAX_MEM helped substantially in that they
made it much easier to see what was going on with the Java GC.
BigDesk is awesome! Way better than just running curl over and over.
jstat is awesome!
It turns out that ES on SurvivorRatio=8 with UseParNewGC would
slowly fill up eden space until the GC paused and moved those objects
to S0 or S1. This was causing all our CPU spikes because our eden
space was 2 GB given that we have a 6 GB heap and the default old:new
ratio is 1:2. Once we realized this, we bounded the eden space with
NewSize and MaxNewSize to be 320 MB (down from 2 GB). Now GC happens
more periodically but faster, saturating at most one of our cores
rather than all four. We've got more tuning to do, but for now ES
remains responsive during these smaller but more frequent collections,
which makes us happy.
Thanks for everybody's help!
Hao Lian
Fog Creek Software
You don't have to move to Java 7, but can you try and run with a newer Java 6 version? u18 is very old.
Good idea! I'll ask our sysadmins.
Where did you see that the value of the new generation was 2gb?
We calculated this purely from JVM docs. (If there's a tool to measure
this for us, we'd love to have it.) I'm redoing the calculations and
getting different values now so I think it's entirely possible we
made an error in the heat of the moment, but our impetus for
decreasing it was that we were seeing eden steadily climb up and up in
jstat until it became full at which point a move to the survivor
spaces occurred and spiked the CPU.
I added support to get the memory pool sizes when getting the JVM stats using the node stats API, so it will be simpler to know what values each has (including the eden/new).
On Monday, February 6, 2012 at 6:15 PM, Hao Lian wrote:
You don't have to move to Java 7, but can you try and run with a newer Java 6 version? u18 is very old.
Good idea! I'll ask our sysadmins.
Where did you see that the value of the new generation was 2gb?
We calculated this purely from JVM docs. (If there's a tool to measure
this for us, we'd love to have it.) I'm redoing the calculations and
getting different values now so I think it's entirely possible we
made an error in the heat of the moment, but our impetus for
decreasing it was that we were seeing eden steadily climb up and up in
jstat until it became full at which point a move to the survivor
spaces occurred and spiked the CPU.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.