ElasticSearch hangs/freezes EC2 box

We have ElasticSearch 0.16.2 running as the only thing on an EC2 large
instance. As time goes on (8-16 hours) the box starts to become
unresponsive for periods. First, for maybe 30 seconds at a time, then
longer, then it completely will lock up the box so we have to restart
it. During this time, load averages go up to 40 or so and we're unable
to run any commands on the box, even if we've got an open shell.

We tried changing the GC settings. We actually disabled all the GC
settings and elasticsearch stayed up much longer. But after 24 hours
or so, it resulted in the same lockups. The lockups are NOT happening
during a full GC it appears. Here is jstat right before and right
after the lockup:

Timestamp S0 S1 E O P YGC YGCT
FGC FGCT GCT
66843.9 30.26 0.00 21.18 90.56 50.89 3084 140.545
5 2.449 142.994
67599.5 30.26 0.00 31.05 90.56 50.90 3084 140.545
5 2.449 142.994

jstat was set to poll every second, but it was not able to do so due
to the lockup, so that is the memory right before and right after a
lockup.

We do not have mlockall enabled and we have very little data in the
DB.

java -version

java version "1.6.0_20"
OpenJDK Runtime Environment (IcedTea6 1.9.8)
(6b20-1.9.8-0ubuntu1~10.04.1)
OpenJDK 64-Bit Server VM (build 19.0-b09, mixed mode)

uname -a

Linux ec2-184-72-76-49.compute-1.amazonaws.com 2.6.32-309-ec2 #18-
Ubuntu SMP Mon Oct 18 21:00:50 UTC 2010 x86_64 GNU/Linux

Our DB size on the filesystem is 174 MB. We have ES_MIN/MAX_MEM set to
4GB. There are a few queries happening during this, but not during the
lockup and nothing significant traffic wise.

We've also noticed that our used memory goes up and up and up and the
amount of reclaimed memory....

for instance, right before we restarted elastic search:
[2011-06-21 15:58:57,288][DEBUG][monitor.jvm ]
[Bushwacker] [gc][PS Scavenge][502] took [13ms]/[21.8s], reclaimed
[306mb], leaving [312.8mb] used, max [4.1gb]

And right after we started it again:
[2011-06-21 16:30:41,115][DEBUG][monitor.jvm ] [Milan]
[gc][PS Scavenge][3] took [127ms]/[247ms], reclaimed [243.4mb],
leaving [71mb] used, max [4.1gb]
[2011-06-21 16:33:38,956][DEBUG][monitor.jvm ] [Milan]
[gc][PS Scavenge][4] took [141ms]/[388ms], reclaimed [254.1mb],
leaving [74.5mb] used, max [4.1gb]

(these numbers are similar with the default GC settings).

It really seems like it is leaking memory too and generating a TON of
objects since it's reclaimng 240 MB every scavenge! Why would it be
doing this?

Any ideas?

Dan

On Jun 21, 2011, at 9:34 AM, Dan Diephouse wrote:

We have Elasticsearch 0.16.2 running as the only thing on an EC2 large
instance. As time goes on (8-16 hours) the box starts to become
unresponsive for periods. First, for maybe 30 seconds at a time, then
longer, then it completely will lock up the box so we have to restart
it. During this time, load averages go up to 40 or so and we're unable
to run any commands on the box, even if we've got an open shell.

Ubuntu 10.04 Lucid has a known issue in its kernel which causes this problem. JVM-based apps seem to trigger is extremely easily. Upgrade your kernel or distribution.