Promotion failures (GC issues)


(nicolas.long) #1

Hey all,

we regularly (several times a week) get longish GCs (20s or more) due to
promotion failures.

From what I understand this type of major GC is caused by fragmentation of
the heap.

So I'm wondering:

  1. What is all the stuff ES puts into the heap that ends up in the Old Gen?
  2. Are there any recommended strategies for dealing with this specific kind
    of problem.

For example, would allowing more filter caching help or cause even more
problems? And so on.

To give a little more info on our usage, we're read heavy, nearly entirely
filter operations. Our heap is at ~10g. nearly all of which is used by the
Old Gen (until a major GC runs).

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4b3ef926-94b7-4de0-b076-d5fdbc44021c%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #2

Maybe it's the field cache that moves to old gen, when using facets.

I am tackling this challenge by a combination of several strategies

  • tuning index.indices.fielddata.cache.size

  • working around the issue by increasing node transport and ping timeout
    from 5s to something high like 30s (so GCs are allowed to run 20s without
    node disconnects)

  • reducing number of shards per node (this just means to reduce the number
    of docs / index size / filter cache per node somehow), simplest method is
    adding nodes

  • using heap sizes as small as possible - in my use case 6G are sufficient

  • not sure if you want to go the path on the bleeding edge, but using Java
    8 and G1GC with XX:MaxGCPauseMillis of ~100-1000ms helps me. CPU load is a
    bit higher with G1GC, but since I have 32 cores on a node, it does not
    matter that much.

  • otherwise, there are lots of CMS GC tuning options (needs deep GC
    analysis)

Jörg

On Mon, Feb 17, 2014 at 4:34 PM, Nic Long nicolas.long@guardian.co.ukwrote:

Hey all,

we regularly (several times a week) get longish GCs (20s or more) due to
promotion failures.

From what I understand this type of major GC is caused by fragmentation of
the heap.

So I'm wondering:

  1. What is all the stuff ES puts into the heap that ends up in the Old Gen?
  2. Are there any recommended strategies for dealing with this specific
    kind of problem.

For example, would allowing more filter caching help or cause even more
problems? And so on.

To give a little more info on our usage, we're read heavy, nearly entirely
filter operations. Our heap is at ~10g. nearly all of which is used by the
Old Gen (until a major GC runs).

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/4b3ef926-94b7-4de0-b076-d5fdbc44021c%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoG7nX7YRy7dEnfDToWaPXvVTjfwP%3DXYdPzjRk91YJ0d%2BA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(nicolas.long) #3

Hey Jörg,

thanks for the detailed reply.

We don't really run facets and our field data cache size is very small.
Increasing the node transport and ping timeouts is definitely something
we'll consider. Reducing the number of shards per node is also something to
consider, but am reluctant to add more nodes at the moment (already
spending lots of cash).

I think a deep dive into GC tuning is possibly called for, and we've done
some of that already.

Java 8 with G1GC is an interesting suggestion too!

Thanks again,

Nic

On Monday, 17 February 2014 20:50:30 UTC, Jörg Prante wrote:

Maybe it's the field cache that moves to old gen, when using facets.

I am tackling this challenge by a combination of several strategies

  • tuning index.indices.fielddata.cache.size

  • working around the issue by increasing node transport and ping timeout
    from 5s to something high like 30s (so GCs are allowed to run 20s without
    node disconnects)

  • reducing number of shards per node (this just means to reduce the number
    of docs / index size / filter cache per node somehow), simplest method is
    adding nodes

  • using heap sizes as small as possible - in my use case 6G are sufficient

  • not sure if you want to go the path on the bleeding edge, but using Java
    8 and G1GC with XX:MaxGCPauseMillis of ~100-1000ms helps me. CPU load is a
    bit higher with G1GC, but since I have 32 cores on a node, it does not
    matter that much.

  • otherwise, there are lots of CMS GC tuning options (needs deep GC
    analysis)

Jörg

On Mon, Feb 17, 2014 at 4:34 PM, Nic Long <nicola...@guardian.co.uk<javascript:>

wrote:

Hey all,

we regularly (several times a week) get longish GCs (20s or more) due to
promotion failures.

From what I understand this type of major GC is caused by fragmentation
of the heap.

So I'm wondering:

  1. What is all the stuff ES puts into the heap that ends up in the Old
    Gen?
  2. Are there any recommended strategies for dealing with this specific
    kind of problem.

For example, would allowing more filter caching help or cause even more
problems? And so on.

To give a little more info on our usage, we're read heavy, nearly
entirely filter operations. Our heap is at ~10g. nearly all of which is
used by the Old Gen (until a major GC runs).

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/4b3ef926-94b7-4de0-b076-d5fdbc44021c%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d8803723-0cec-40ba-a095-4fe73f123e75%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #4