Very long GC

Hi,

We run an 8 instances ES cluster in EC2, one index (size: 3695.5gb),
48 shards, 1 replica per shard. ES version is 0.20.2, Oracle JVM
1.6.38, 30GB heap, total RAM 60GB, 2 * 1024GB SSD drives in raid 0
(lvm stripe), 16 cores, no faceting or routing.

We occasionally observe extremely long GC, > 5 minutes which result in
nodes taken out of the cluster and then rejoining it (and consequent
shard redistribution).

I am new to elasticsearch and I am trying to understand how I can
investigate the reasons for the long GC pauses. Looking at the logs
and bigdesk/paramedic plugins + our graphite graphs I cannot seem to
find a correlation between this very long GC pauses and anything else,
most likely I am not looking at the right metrics.

Some ideas floating around are separating master and query nodes from
data only nodes, moving to java7 and perhaps have more smaller shards,
but first we need to understand what is causing long GC.

Our JVM settings:

-server
-Djava.net.preferIPv4Stack=true
-Xms30g
-Xmx30g
-Xss256k
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly

Any help would be greatly appreciated, thanks

Simone

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Long GC with CMS GC are caused by "stop the world" phases and large heaps,
while your workload is creating much memory pressure.

Java 7 is the current Java and Java 6 is no longer supported by Oracle.
Note that Java 6 and CMS GC were not designed
with large heaps in mind.

Have you sized your ES cluster? How did you conclude that 8 nodes are
enough? Do you have configured caches/filter memory?

My recommendations

  • update to Java 7

  • update to ES 0.90+. Lucene 4 give a fulminant factor of memory savings
    and ES 0.90 caches/filters improves memory usage a lot

  • if you still get issues, try to reduce the "stop the world" duration by
    carefully adapting the heap size to your workload, or by lowering the
    memory pressure. You can scale ES over more nodes to achieve this

  • if this still doesn't suffice, you could also consider to switch to an
    improved GC algorithm. Maybe G1 GC is an option for you, but I am aware of
    fatal JVM crashes with trove4j, a component of ES, the issue is not caused
    by ES, but very annoying.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

30G is fairly large heap but GC of 5 mts sounds too much for 30G heap. Can
you share gc log files? You can enable gc logging if you haven't done so
already.

Also, try lowering your heap to say 18G and see the effect.

How big is your -Xmn?

On Mon, Aug 12, 2013 at 7:28 AM, Simone Sciarrati s.sciarrati@gmail.comwrote:

Hi,

We run an 8 instances ES cluster in EC2, one index (size: 3695.5gb),
48 shards, 1 replica per shard. ES version is 0.20.2, Oracle JVM
1.6.38, 30GB heap, total RAM 60GB, 2 * 1024GB SSD drives in raid 0
(lvm stripe), 16 cores, no faceting or routing.

We occasionally observe extremely long GC, > 5 minutes which result in
nodes taken out of the cluster and then rejoining it (and consequent
shard redistribution).

I am new to elasticsearch and I am trying to understand how I can
investigate the reasons for the long GC pauses. Looking at the logs
and bigdesk/paramedic plugins + our graphite graphs I cannot seem to
find a correlation between this very long GC pauses and anything else,
most likely I am not looking at the right metrics.

Some ideas floating around are separating master and query nodes from
data only nodes, moving to java7 and perhaps have more smaller shards,
but first we need to understand what is causing long GC.

Our JVM settings:

-server
-Djava.net.preferIPv4Stack=true
-Xms30g
-Xmx30g
-Xss256k
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly

Any help would be greatly appreciated, thanks

Simone

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

The first thing that comes to mind is to reduce your heap size - gc
grows with heap size. Start with 8G and increase based on tuning. This
should also give more memory over to the filesystem cache. Enable gc
logging on the server and run it through a tool like gcviewer to help
understand the heap behaviour, and something like yourkit to see what's
getting allocated (eg you might want to check the filter cache to see if
it's generating garbage due to a large working set, or if lucene is
doing a lot of merging).

-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly

Run 'top -H' on the server; if you see a single java thread occupying
the cpu with other threads paused, that's maybe indicative of a full GC
happening. If your GC analysis indicates a lot of parnew with periodic
full gcs due to promotion failures, you'll want to start looking at the
output of PrintTenuringDistribution to feed into tuning
CMSInitiatingOccupancyFraction/MaxTenuringThreshold/SurvivorRatio. At
the heap size you're running unless you can promotion failures you are
going to have significant pauses from time to time. The right settings
are likely going to be specific to ES and your workload.

But I would simply start with a massively reduced heap especially given
your servers have SSDs as a fallback.

Bill

On 2013/08/12 15:28, Simone Sciarrati wrote:

Hi,

We run an 8 instances ES cluster in EC2, one index (size: 3695.5gb),
48 shards, 1 replica per shard. ES version is 0.20.2, Oracle JVM
1.6.38, 30GB heap, total RAM 60GB, 2 * 1024GB SSD drives in raid 0
(lvm stripe), 16 cores, no faceting or routing.

We occasionally observe extremely long GC, > 5 minutes which result in
nodes taken out of the cluster and then rejoining it (and consequent
shard redistribution).

I am new to elasticsearch and I am trying to understand how I can
investigate the reasons for the long GC pauses. Looking at the logs
and bigdesk/paramedic plugins + our graphite graphs I cannot seem to
find a correlation between this very long GC pauses and anything else,
most likely I am not looking at the right metrics.

Some ideas floating around are separating master and query nodes from
data only nodes, moving to java7 and perhaps have more smaller shards,
but first we need to understand what is causing long GC.

Our JVM settings:

-server
-Djava.net.preferIPv4Stack=true
-Xms30g
-Xmx30g
-Xss256k
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly

Any help would be greatly appreciated, thanks

Simone

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Adding one more question, do you use mmapfs and boostrap.mlockall?

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Simone,

We had issues with long GC pauses too, but we were able to resolve them by
adjusting our JVM command line flags. Unfortunately, when I was
experimenting with solutions I wasn’t very disciplined about adjusting only
one variable at a time, so I can’t say for certain exactly what change
resolved our issue. But I have a good guess.

I think the thing that worked for us was removing the flags
-XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly.
These flags are included in ./bin/elasticsearch.in.sh that the
Elasticsearch launch script uses by default, but I strongly suspect they
should be removed. These flags tell the JVM to begin a CMS collection when
and only when the old generation becomes more than 75% full. If your
application is allocating so rapidly that the old generation is filled
before the CMS collection as completed, then a "concurrent mode failure"
has occurred, and the JVM will fall back to a full stop-the-world
collection which may have a long pause time.

I believe (but am not certain) that if you do not set these flags, then the
JVM will tune the CMSInitiatingOccupancyFraction on its own at runtime in
an effort to optimize the GC performance and avoid "concurrent mode
failure". I also think (but am not certain) that if you do set these flags,
then the JVM will assume you know better than it does, and will not tune
them further at runtime.

I also recommend you turn on GC logging. You can do this by including these
flags: -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -Xloggc:<PATH TO LOG FILE>. The log format is not intuitive, but with enough Google searches
you can begin to make sense of it. If you’re using CMS and you see Full GC in the GC logs move than once every few days, then something is amiss.
My current GC log goes back over a month, and I have only one Full GC in it.

Good luck,

-Jon

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

Thanks a lot for the great replies, a lot for me to work on.

Some replies:

  • Have you sized your ES cluster? How did you conclude that 8 nodes
    are enough? Do you have configured caches/filter memory?

I was not part of this decision, to be honest I happened to have to
touch elasticsearch for the first time when the result of a long GC
and some other collateral damage made the cluster red. Since then I am
trying to understand what we can do better. I have to check about
caches/filter

  • update to Java 7

We are going to experiment one node on openJDK 7

  • update to ES 0.90+

More difficult to do but we will at some point soon

  • if you still get issues, try to reduce the "stop the world" duration
    by carefully adapting the heap size to your workload, or by lowering
    the memory pressure. You can scale ES over more nodes to achieve this

This is probably where I will put most of my efforts, more smaller
nodes, make shards max size not larger than heap size (hint from
another post, right now shard size is ~ 2.2 times the heap size), run
separate dedicated query, index, data and master nodes

  • Can you share gc log files?

Unfortunately I don't have any (will enable), but I do have data in
graphite, perhaps if you can give me an idea of what you suspect you'd
find in the logs I could look for it in the graphs.

  • How big is your -Xmn?

We don't pass this parameter so we must be running the defaults, the logs show:

[gc][ParNew][1686997][5685847] duration [6s], collections [6]/[13.5s],
total [6s]/[4.9d], memory [29.2gb]->[15.1gb]/[29.9gb], all_pools
{[Code Cache] [13.4mb]->[13.4mb]/[48mb]}{[Par Eden Space]
[125.6mb]->[64mb]/[216.3mb]}{[Par Survivor Space]
[27mb]->[27mb]/[27mb]}{[CMS Old Gen] [29.1gb]->[15gb]/[29.7gb]}{[CMS
Perm Gen] [46.1mb]->[46.1mb]/[82mb]}

  • Enable gc logging on the server and run it through a tool like
    gcviewer to help understand the heap behaviour, and something like
    yourkit to see what's getting allocated (eg you might want to check
    the filter cache to see if it's generating garbage due to a large
    working set, or if lucene is doing a lot of merging).

Will do once we enable gc logging, thanks for the other suggestions

  • do you use mmapfs and boostrap.mlockall

boostrap.mlockall

  • These flags tell the JVM to begin a CMS collection when and only
    when the old generation becomes more than 75% full. If your
    application is allocating so rapidly that the old generation is filled
    before the CMS collection as completed, then a "concurrent mode
    failure" has occurred, and the JVM will fall back to a full
    stop-the-world collection

I can see how this could be what's happening, we do indeed allocate
memory very rapidly, but I will go back and check the data before
jumping to conclusions, thanks a lot

Many thanks,

Simone

On Mon, Aug 12, 2013 at 10:59 PM, Jon Shea jonshea@foursquare.com wrote:

Hi Simone,

We had issues with long GC pauses too, but we were able to resolve them by
adjusting our JVM command line flags. Unfortunately, when I was
experimenting with solutions I wasn’t very disciplined about adjusting only
one variable at a time, so I can’t say for certain exactly what change
resolved our issue. But I have a good guess.

I think the thing that worked for us was removing the flags
-XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly.
These flags are included in ./bin/elasticsearch.in.sh that the
Elasticsearch launch script uses by default, but I strongly suspect they
should be removed. These flags tell the JVM to begin a CMS collection when
and only when the old generation becomes more than 75% full. If your
application is allocating so rapidly that the old generation is filled
before the CMS collection as completed, then a "concurrent mode failure" has
occurred, and the JVM will fall back to a full stop-the-world collection
which may have a long pause time.

I believe (but am not certain) that if you do not set these flags, then the
JVM will tune the CMSInitiatingOccupancyFraction on its own at runtime in an
effort to optimize the GC performance and avoid "concurrent mode failure". I
also think (but am not certain) that if you do set these flags, then the JVM
will assume you know better than it does, and will not tune them further at
runtime.

I also recommend you turn on GC logging. You can do this by including these
flags: -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -Xloggc:<PATH TO LOG FILE>. The log format is not intuitive, but with enough Google searches you
can begin to make sense of it. If you’re using CMS and you see Full GC in
the GC logs move than once every few days, then something is amiss. My
current GC log goes back over a month, and I have only one Full GC in it.

Good luck,

-Jon

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

From your posted GC log entry, your node stalled 6 seconds. This exceeds
the default ping timeout of 5s in 0.20. I think 6 seconds is very fast to
collapse heap objects from 29.9g down to 15.1g.

You should decrease the heap size, and enable mmapfs. This also may take
some pressure from reorganizing large heaps in the GC.

As a workaround, you could increase the timeout settings
'transport.ping_timeout' in clients and 'discovery.zen.fd.ping_timeout' in
data nodes (fd = fault detection) to accept longer outages of nodes because
of GC stalls.

Upgrading to the latest 0.90.x release will really help. Also, the timeout
defaults have changed.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

The timeout is set to 10 * 30sec, totalling 5 minutes (way too long) I
think this was in an attempt to avoid nodes dropping off the cluster
as you say. We will add mmapfs to our tests, thanks.

I have enabled gc logging on one machine and will share it later today
after I have collected enough data.

Thanks a lot,

Simone

On Wed, Aug 14, 2013 at 11:24 AM, joergprante@gmail.com
joergprante@gmail.com wrote:

From your posted GC log entry, your node stalled 6 seconds. This exceeds the
default ping timeout of 5s in 0.20. I think 6 seconds is very fast to
collapse heap objects from 29.9g down to 15.1g.

You should decrease the heap size, and enable mmapfs. This also may take
some pressure from reorganizing large heaps in the GC.

As a workaround, you could increase the timeout settings
'transport.ping_timeout' in clients and 'discovery.zen.fd.ping_timeout' in
data nodes (fd = fault detection) to accept longer outages of nodes because
of GC stalls.

Upgrading to the latest 0.90.x release will really help. Also, the timeout
defaults have changed.

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hello,

Please find the gc logs at http://espastebin.s3.amazonaws.com/gc.log
it is ~ 20mb, I am running it through gcviewer as suggested and trying
to make sense of it, any help appreciated.

Thanks,

Simone

On Wed, Aug 14, 2013 at 11:35 AM, Simone Sciarrati
s.sciarrati@gmail.com wrote:

Hi,

The timeout is set to 10 * 30sec, totalling 5 minutes (way too long) I
think this was in an attempt to avoid nodes dropping off the cluster
as you say. We will add mmapfs to our tests, thanks.

I have enabled gc logging on one machine and will share it later today
after I have collected enough data.

Thanks a lot,

Simone

On Wed, Aug 14, 2013 at 11:24 AM, joergprante@gmail.com
joergprante@gmail.com wrote:

From your posted GC log entry, your node stalled 6 seconds. This exceeds the
default ping timeout of 5s in 0.20. I think 6 seconds is very fast to
collapse heap objects from 29.9g down to 15.1g.

You should decrease the heap size, and enable mmapfs. This also may take
some pressure from reorganizing large heaps in the GC.

As a workaround, you could increase the timeout settings
'transport.ping_timeout' in clients and 'discovery.zen.fd.ping_timeout' in
data nodes (fd = fault detection) to accept longer outages of nodes because
of GC stalls.

Upgrading to the latest 0.90.x release will really help. Also, the timeout
defaults have changed.

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

Some more info, we are running the default indices.cache.filter.size
at 20% of our heap i.e. 6GB, this is also confirmed in bigdesk:

Cache size
Filter: 5.9gb
Field: 1.9gb

Cache evictions
Filter: 68597765
Field: 0

I have been looking at the gcviewer graphs and reading the gc logs,
our heap grows very quickly (it can grow 5GB and be garbage collected
5GB 3 times in a minute), and looking at the logs it seems to me we
are not giving enough time to the objects to die young, rarely objects
make it into age2 and the 'desired survivors size' is constantly
exceeded leading to 'next threshold' being almost constantly 1. The
overall throughput is showed as ~ 62% with > 850 FGC in 26hrs.

I have been experimenting with new generation and survivors ratio,
added -Xmn512m to the args (this is doubling what was before) and I
have restarted the same node, took some time before it got shards back
and some real load. I can now see that it is getting ~100 search
requests per second with peaks of ~ 300, and an average of 50 indexing
requests per second. Compared to another node in the cluster that
shows similar numbers I can observe a more regular graphs for the GC,
the old gen still grows quickly but not as much as before and it
generally stays well below the max heap size, YGC are quick and we
only had a couple FGC that have been relatively ok (5/8 sec). Still
not the kind of graphs I would like to see but it seems to be moving
in the right direction.

On another node I have also modified the survivors ratio to 1:3 Xmn to
get larger survivors space, the problem is that after restarting ES it
now only holds 2 primary shards out of 11 total (we use _primary_first
and have't been able to change this yet) and it is not getting any
query at all, I am not sure why yet...

The other issue I am seeing is that the query time is very variable,
the same simple request (REST) looped with a 1 sec delay can return in
<200ms or >5/10sec, I am trying to find out where the query is
spending time but I haven't found an easy way to get this data, any
suggestions? Perhaps setting very low query and fetch trace threshold
though the API?

Many thanks,

Simone

On Wed, Aug 14, 2013 at 3:43 PM, Simone Sciarrati s.sciarrati@gmail.com wrote:

Hello,

Please find the gc logs at http://espastebin.s3.amazonaws.com/gc.log
it is ~ 20mb, I am running it through gcviewer as suggested and trying
to make sense of it, any help appreciated.

Thanks,

Simone

On Wed, Aug 14, 2013 at 11:35 AM, Simone Sciarrati
s.sciarrati@gmail.com wrote:

Hi,

The timeout is set to 10 * 30sec, totalling 5 minutes (way too long) I
think this was in an attempt to avoid nodes dropping off the cluster
as you say. We will add mmapfs to our tests, thanks.

I have enabled gc logging on one machine and will share it later today
after I have collected enough data.

Thanks a lot,

Simone

On Wed, Aug 14, 2013 at 11:24 AM, joergprante@gmail.com
joergprante@gmail.com wrote:

From your posted GC log entry, your node stalled 6 seconds. This exceeds the
default ping timeout of 5s in 0.20. I think 6 seconds is very fast to
collapse heap objects from 29.9g down to 15.1g.

You should decrease the heap size, and enable mmapfs. This also may take
some pressure from reorganizing large heaps in the GC.

As a workaround, you could increase the timeout settings
'transport.ping_timeout' in clients and 'discovery.zen.fd.ping_timeout' in
data nodes (fd = fault detection) to accept longer outages of nodes because
of GC stalls.

Upgrading to the latest 0.90.x release will really help. Also, the timeout
defaults have changed.

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.