Periodic CPU spikes

Hi!

We've been load-testing Elastic Search before we release to production
(and we love it!) and we've been noticing periodic CPU spikes:

http://i.imgur.com/PIbNb.png

We have two indices, both with the same mapping:

https://raw.github.com/gist/f959b5f1502e12e1dc2b/c95c208ff9c6232de5846ea6abd4b9f96c8c6570/gistfile1.txt

The only APIs we use are the Index and the Search API. We used to have
long-request problems because we'd send Search requests with 1000+
facets but we rewrote that code recently to use one facet (histogram)
and it's been smooth sailing then. However, the CPU spikes persist and
every now and then one of our calls to ES will time out. (We use NEST
the C# client to talk to ES.)

We're running ES on an eight-gigabyte, four-core virtual machine. Here
are our node stats:

https://raw.github.com/gist/f0246240e0285ca16bfb/d1090e3cbaf7347b55f5e5356bf61caff072ba56/gistfile1.txt

and our node health:

https://raw.github.com/gist/b4edaa73c14be8829919/431138ed026b576a2ea7362b879c75dde337576c/gistfile1.txt

It's using the standard configuration (1 node, 1 replica, 10 shards
total) that comes out of the box, and we're running ES via the service
script and conf file that came with ES. What do you guys think we
should do to make these spikes go away? Should we be running more
nodes (if so, a node on a different box or a node on the same box)?
Any insight into this problem would be super-appreciated.

And I don't know if this is relevant at all, but the only way I've
been able to simulate a spike is by POSTing to _cache/clear. (That's
why you see three spikes back-to-back in that graph toward the end;
that's me, poking around, sending cache-clear POSTs.)

Thanks!
Hao Lian
Fog Creek Software

Hao,

This could be JVM GC in action. Do you have something to monitor the
JVMs? I'd point you to our Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

  • but it currently supports only Solr, not Elasticsearch yet. Try
    using jstat, even that will tell you if it's GC.

Otis

On Feb 1, 6:43 pm, Hao Lian h...@fogcreek.com wrote:

Hi!

We've been load-testing Elastic Search before we release to production
(and we love it!) and we've been noticing periodic CPU spikes:

http://i.imgur.com/PIbNb.png

We have two indices, both with the same mapping:

https://raw.github.com/gist/f959b5f1502e12e1dc2b/c95c208ff9c6232de584...

The only APIs we use are the Index and the Search API. We used to have
long-request problems because we'd send Search requests with 1000+
facets but we rewrote that code recently to use one facet (histogram)
and it's been smooth sailing then. However, the CPU spikes persist and
every now and then one of our calls to ES will time out. (We use NEST
the C# client to talk to ES.)

We're running ES on an eight-gigabyte, four-core virtual machine. Here
are our node stats:

https://raw.github.com/gist/f0246240e0285ca16bfb/d1090e3cbaf7347b55f5...

and our node health:

https://raw.github.com/gist/b4edaa73c14be8829919/431138ed026b576a2ea7...

It's using the standard configuration (1 node, 1 replica, 10 shards
total) that comes out of the box, and we're running ES via the service
script and conf file that came with ES. What do you guys think we
should do to make these spikes go away? Should we be running more
nodes (if so, a node on a different box or a node on the same box)?
Any insight into this problem would be super-appreciated.

And I don't know if this is relevant at all, but the only way I've
been able to simulate a spike is by POSTing to _cache/clear. (That's
why you see three spikes back-to-back in that graph toward the end;
that's me, poking around, sending cache-clear POSTs.)

Thanks!
Hao Lian
Fog Creek Software

Heya,

You sent the node info and not node stats, can you send the node stats as well? From the node info, I can already see something things that you should do:

  1. Set the min and max memory to be the same. And set to ~4gb out of the machine total 8gb.
  2. Use a more recent JVM, the one you are using is old.
  3. Are you running on windows or *nix? If *unix, enable mlockall (so the java process will not swap).
  4. Install big desk, and see what happens when the CPU spikes, does the GC time / count increases around the same interval?
  5. It might be internal merges going on, you can tell if merges are happening using the node stats API, and how long they took (hopefully we will have this graphed in big desk soon).

On Thursday, February 2, 2012 at 1:43 AM, Hao Lian wrote:

Hi!

We've been load-testing Elastic Search before we release to production
(and we love it!) and we've been noticing periodic CPU spikes:

http://i.imgur.com/PIbNb.png

We have two indices, both with the same mapping:

https://raw.github.com/gist/f959b5f1502e12e1dc2b/c95c208ff9c6232de5846ea6abd4b9f96c8c6570/gistfile1.txt

The only APIs we use are the Index and the Search API. We used to have
long-request problems because we'd send Search requests with 1000+
facets but we rewrote that code recently to use one facet (histogram)
and it's been smooth sailing then. However, the CPU spikes persist and
every now and then one of our calls to ES will time out. (We use NEST
the C# client to talk to ES.)

We're running ES on an eight-gigabyte, four-core virtual machine. Here
are our node stats:

https://raw.github.com/gist/f0246240e0285ca16bfb/d1090e3cbaf7347b55f5e5356bf61caff072ba56/gistfile1.txt

and our node health:

https://raw.github.com/gist/b4edaa73c14be8829919/431138ed026b576a2ea7362b879c75dde337576c/gistfile1.txt

It's using the standard configuration (1 node, 1 replica, 10 shards
total) that comes out of the box, and we're running ES via the service
script and conf file that came with ES. What do you guys think we
should do to make these spikes go away? Should we be running more
nodes (if so, a node on a different box or a node on the same box)?
Any insight into this problem would be super-appreciated.

And I don't know if this is relevant at all, but the only way I've
been able to simulate a spike is by POSTing to _cache/clear. (That's
why you see three spikes back-to-back in that graph toward the end;
that's me, poking around, sending cache-clear POSTs.)

Thanks!
Hao Lian
Fog Creek Software

Thanks Otis and Shay, we'll get right on following those leads.

Here are our node stats, sorry about that mixup:

https://raw.github.com/gist/78972f7ed681bfb44a24/b365f854df4a5df427c9352ed6afef225ad3fd7a/gistfile1.txt
(non-spiky load)
https://raw.github.com/gist/ff0a02b9dcff4b84c554/367695337c9113f6e66e735fa81d0985e6858540/gistfile1.txt
(spiky load, pulled from yesterday)

The merges don't look too problematic in either case, but thanks for
the heads up there.

We are running on Debian Linux; investigating mlockall as we speak in
addition to installing big desk and jstat.

Thanks again!
Hao.

Not sure which ES version you are using, so just note about bigdesk, it
does not support 0.19.0 snapshot ES version very well, there are a lot of
breaking changes in ES master, but I am working on a fix as we speak...

On Thu, Feb 2, 2012 at 8:43 PM, Hao Lian hao@fogcreek.com wrote:

Thanks Otis and Shay, we'll get right on following those leads.

Here are our node stats, sorry about that mixup:

https://raw.github.com/gist/78972f7ed681bfb44a24/b365f854df4a5df427c9352ed6afef225ad3fd7a/gistfile1.txt
(non-spiky load)

https://raw.github.com/gist/ff0a02b9dcff4b84c554/367695337c9113f6e66e735fa81d0985e6858540/gistfile1.txt
(spiky load, pulled from yesterday)

The merges don't look too problematic in either case, but thanks for
the heads up there.

We are running on Debian Linux; investigating mlockall as we speak in
addition to installing big desk and jstat.

Thanks again!
Hao.

OK, we think we fixed it, and I just wanted to report a little about
what we did. We couldn't upgrade to JVM 7 because it would've been too
big a drain on our sysadmin resources, but we did solve the problem.

  • Setting ES_MIN_MEM = ES_MAX_MEM helped substantially in that they
    made it much easier to see what was going on with the Java GC.

  • BigDesk is awesome! Way better than just running curl over and over.

  • jstat is awesome!

  • It turns out that ES on SurvivorRatio=8 with UseParNewGC would
    slowly fill up eden space until the GC paused and moved those objects
    to S0 or S1. This was causing all our CPU spikes because our eden
    space was 2 GB given that we have a 6 GB heap and the default old:new
    ratio is 1:2. Once we realized this, we bounded the eden space with
    NewSize and MaxNewSize to be 320 MB (down from 2 GB). Now GC happens
    more periodically but faster, saturating at most one of our cores
    rather than all four. We've got more tuning to do, but for now ES
    remains responsive during these smaller but more frequent collections,
    which makes us happy.

Thanks for everybody's help!
Hao Lian
Fog Creek Software

Heya,

  • You don't have to move to Java 7, but can you try and run with a newer Java 6 version? u18 is very old.
  • Where did you see that the value of the new generation was 2gb? Can you gist it?

On Friday, February 3, 2012 at 7:21 PM, Hao Lian wrote:

OK, we think we fixed it, and I just wanted to report a little about
what we did. We couldn't upgrade to JVM 7 because it would've been too
big a drain on our sysadmin resources, but we did solve the problem.

  • Setting ES_MIN_MEM = ES_MAX_MEM helped substantially in that they
    made it much easier to see what was going on with the Java GC.

  • BigDesk is awesome! Way better than just running curl over and over.

  • jstat is awesome!

  • It turns out that ES on SurvivorRatio=8 with UseParNewGC would
    slowly fill up eden space until the GC paused and moved those objects
    to S0 or S1. This was causing all our CPU spikes because our eden
    space was 2 GB given that we have a 6 GB heap and the default old:new
    ratio is 1:2. Once we realized this, we bounded the eden space with
    NewSize and MaxNewSize to be 320 MB (down from 2 GB). Now GC happens
    more periodically but faster, saturating at most one of our cores
    rather than all four. We've got more tuning to do, but for now ES
    remains responsive during these smaller but more frequent collections,
    which makes us happy.

Thanks for everybody's help!
Hao Lian
Fog Creek Software

You don't have to move to Java 7, but can you try and run with a newer Java 6 version? u18 is very old.

Good idea! I'll ask our sysadmins.

Where did you see that the value of the new generation was 2gb?

We calculated this purely from JVM docs. (If there's a tool to measure
this for us, we'd love to have it.) I'm redoing the calculations and
getting different values now :smiley: so I think it's entirely possible we
made an error in the heat of the moment, but our impetus for
decreasing it was that we were seeing eden steadily climb up and up in
jstat until it became full at which point a move to the survivor
spaces occurred and spiked the CPU.

I added support to get the memory pool sizes when getting the JVM stats using the node stats API, so it will be simpler to know what values each has (including the eden/new).

On Monday, February 6, 2012 at 6:15 PM, Hao Lian wrote:

You don't have to move to Java 7, but can you try and run with a newer Java 6 version? u18 is very old.

Good idea! I'll ask our sysadmins.

Where did you see that the value of the new generation was 2gb?

We calculated this purely from JVM docs. (If there's a tool to measure
this for us, we'd love to have it.) I'm redoing the calculations and
getting different values now :smiley: so I think it's entirely possible we
made an error in the heat of the moment, but our impetus for
decreasing it was that we were seeing eden steadily climb up and up in
jstat until it became full at which point a move to the survivor
spaces occurred and spiked the CPU.