Excessive garbage collection

I have a very strange situation around GC:

x-pack will report around 50% heap use, and based on the graphs in x-pack it seems to do a full gc around every hour, still I am experiencing performance problems and seeing:

2018-04-10T19:45:38.994+0200: 121364.173: [GC (Allocation Failure) 2018-04-10T19:45:38.994+0200: 121364.173: [ParNew
Desired survivor size 8716288 bytes, new threshold 1 (max 6)

  • age 1: 12705144 bytes, 12705144 total
  • age 2: 491144 bytes, 13196288 total
  • age 3: 524104 bytes, 13720392 total
  • age 4: 2365904 bytes, 16086296 total
  • age 5: 919392 bytes, 17005688 total
    : 144916K->17024K(153344K), 0.0804029 secs] 9977937K->9856140K(18857344K), 0.0805133 secs] [Times: user=0.15 sys=0.00, real=0.08 secs]
    2018-04-10T19:45:39.075+0200: 121364.254: Total time for which application threads were stopped: 0.0821092 seconds, Stopping threads took: 0.0012391 seconds
    2018-04-10T19:45:42.770+0200: 121367.949: [GC (Allocation Failure) 2018-04-10T19:45:42.770+0200: 121367.949: [ParNew
    Desired survivor size 8716288 bytes, new threshold 6 (max 6)
  • age 1: 6423872 bytes, 6423872 total
    : 153344K->13199K(153344K), 0.1366576 secs] 9992460K->9864645K(18857344K), 0.1367784 secs] [Times: user=0.13 sys=0.00, real=0.14 secs]
    2018-04-10T19:45:42.907+0200: 121368.086: Total time for which application threads were stopped: 0.1494425 seconds, Stopping threads took: 0.0123295 seconds
    2018-04-10T19:45:42.974+0200: 121368.153: [GC (Allocation Failure) 2018-04-10T19:45:42.974+0200: 121368.153: [ParNew
    Desired survivor size 8716288 bytes, new threshold 2 (max 6)
  • age 1: 6224448 bytes, 6224448 total
  • age 2: 5998824 bytes, 12223272 total
    : 149519K->16954K(153344K), 0.0651545 secs] 10000965K->9868400K(18857344K), 0.0652718 secs] [Times: user=0.12 sys=0.00, real=0.06 secs]
    2018-04-10T19:45:43.039+0200: 121368.218: Total time for which application threads were stopped: 0.0664671 seconds, Stopping threads took: 0.0008314 seconds
    2018-04-10T19:45:43.096+0200: 121368.275: [GC (Allocation Failure) 2018-04-10T19:45:43.096+0200: 121368.275: [ParNew
    Desired survivor size 8716288 bytes, new threshold 6 (max 6)
  • age 1: 868512 bytes, 868512 total
  • age 2: 5690600 bytes, 6559112 total
    : 153274K->10365K(153344K), 0.0873122 secs] 10004720K->9867035K(18857344K), 0.0874245 secs] [Times: user=0.16 sys=0.00, real=0.08 secs]
    2018-04-10T19:45:43.184+0200: 121368.362: Total time for which application threads were stopped: 0.0886560 seconds, Stopping threads took: 0.0008844 seconds
    2018-04-10T19:45:43.420+0200: 121368.599: [GC (Allocation Failure) 2018-04-10T19:45:43.420+0200: 121368.599: [ParNew
    Desired survivor size 8716288 bytes, new threshold 1 (max 6)
  • age 1: 11281104 bytes, 11281104 total
  • age 2: 558224 bytes, 11839328 total
  • age 3: 5535280 bytes, 17374608 total
    : 146242K->17024K(153344K), 0.0668974 secs] 10002912K->9875064K(18857344K), 0.0670153 secs] [Times: user=0.12 sys=0.00, real=0.07 secs]
    in the gc log

e.g. it seems to do gc very frequently (and not removing much)

even though the VM is given 18G:
SLsl 719:18 /usr/lib/jvm/jre-1.8.0-openjdk/bin/java -Xms1g -Xmx1g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.io.tmpdir=/tmp/elasticsearch.yloV52c4 -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -Xloggc:logs/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=32 -XX:GCLogFileSize=64m -Des.cgroups.hierarchy.override=/ -Xms18g -Xmx18g -Des.path.home=/usr/share/elasticsearch -Des.path.conf=/usr/share/elasticsearch/config -cp /usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch -Ebootstrap.memory_lock=true

due to this searches in kibana spanning quite a bit of data will fail on timeouts, and the dreaded

[HUCBRNR] [gc][121100] overhead, spent [266ms] collecting in the last [1s]

appears in the log.

what could be the cause of this?

The instance is indexing between 300-700 events a second

The new generation is surprisingly small, and that is what is driving all the ParNew activity. Can you remove these? I am wondering if that is influencing the choice of NewGen size even though they are overriden later.

1 Like

Unfortunately not, it's hardcoded by elastic, see: https://github.com/elastic/elasticsearch-docker/issues/151

That's OK, that is not the issue. Something is deliberately setting MaxNewSize low. With a 6 GB heap then I see this in my gc.log ... -XX:MaxNewSize=174485504. That seems crazy low to me but it wasn't put there by the JVM option fairy, so there must be a reason.

If I add -XX:MaxNewSize=2147483648 to my jvm.options then it is respected. I still get terrible performance from ParNew, just a lot less frequently.

Personally I run logstash using the G1 garbage collector, but I believe that is unsupported by Elastic.

This is elastic, not logstash. I have looked more into the figures, and it seems the young generation is running very often, but there is a lot of free heap, and the old generation runs only about twice an hour.
I tried removing the -Xmx1g and -Xms1g that's there by default in the settings, but it did not influence it in any way. I think the defaults set are not sane.

The args now are:
/usr/lib/jvm/jre-1.8.0-openjdk/bin/java -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.io.tmpdir=/tmp/elasticsearch.zq8Jfdle -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -Xloggc:logs/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=32 -XX:GCLogFileSize=64m -Des.cgroups.hierarchy.override=/ -Xms18g -Xmx18g -Des.path.home=/usr/share/elasticsearch -Des.path.conf=/usr/share/elasticsearch/config -cp /usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch -Ebootstrap.memory_lock=true

In logs/gc.log the third line should have the command line flags. If it does not you can enable it using -XX:+PrintCommandLineFlags. What is the value of -XX:MaxNewSize= ?

fgrep MaxNewSize *
gc.log.0.current:CommandLine flags: -XX:+AlwaysPreTouch -XX:CMSInitiatingOccupancyFraction=75 -XX:GCLogFileSize=67108864 -XX:+HeapDumpOnOutOfMemoryError -XX:InitialHeapSize=19327352832 -XX:MaxHeapSize=19327352832 -XX:MaxNewSize=174485504 -XX:MaxTenuringThreshold=6 -XX:NewSize=174485504 -XX:NumberOfGCLogFiles=32 -XX:OldPLABSize=16 -XX:OldSize=348971008 -XX:-OmitStackTraceInFastThrow -XX:+PrintGC -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:ThreadStackSize=1024 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseGCLogFileRotation -XX:+UseParNewGC

To me telling the JVM to use 166 MB out of an 18 GB heap for the new generation is crazy. I wish I could find where that gets set, or even better, why someone thought it a good idea to set it. They must have had a reason.

What do things look like if you set -XX:MaxNewSize=2147483648 in your jvm.options?

Just a minor clarification as mentioned in https://github.com/elastic/elasticsearch-docker/issues/151#issuecomment-368113213, overriding the heap settings in the Elasticsearch Docker image works and you should be able to confirm this either by:

  1. checking the Elasticsearch log about the heap size when it starts. For example, below I've started with "ES_JAVA_OPTS=-Xms3G -Xmx3G":
    [2018-04-12T07:39:06,417][INFO ][o.e.e.NodeEnvironment ] [4b5QpNp] heap size [2.9gb], compressed ordinary object pointers [true]

  2. or probing the API, e.g.:

    $ curl -s localhost:39200/_nodes/_all | jq .nodes[].jvm.mem
    {
    "heap_init_in_bytes": 3221225472,
    "heap_max_in_bytes": 3133997056,
    "non_heap_init_in_bytes": 2555904,
    "non_heap_max_in_bytes": 0,
    "direct_max_in_bytes": 3133997056
    }
    

    or using filter_path:

    curl -s 'localhost:39200/_nodes/_all?filter_path=nodes.*.jvm.mem.heap*&pretty'
    {
       "nodes" : {
        "yLFcU47RSHOkg-pLZsmZnA" : {
          "jvm" : {
            "mem" : {
              "heap_init_in_bytes" : 3221225472,
              "heap_max_in_bytes" : 3133997056
            }
          }
        }
      }
    }
    
  3. or in the ps output you've included:

    /usr/lib/jvm/jre-1.8.0-openjdk/bin/java -Xms1g -Xmx1g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.io.tmpdir=/tmp/elasticsearch.yloV52c4 -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -Xloggc:logs/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=32 -XX:GCLogFileSize=64m -Des.cgroups.hierarchy.override=/ -Xms18g -Xmx18g -Des.path.home=/usr/share/elasticsearch -Des.path.conf=/usr/share/elasticsearch/config -cp /usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch -Ebootstrap.memory_lock=true

    where you can see the parameters -Xms18g -Xmx18g defined before -Des.path.home.

Yes the actual heapsize can be set with the parameters, but what is not so good are the sizings of the young gen, see comments from Badger above on "MaxNewSize"

Just for reference for future readers that might find this thread: The JVM determines the size of the young generation ergonomically (i.e. based on the capabilities of the hardware). One of the parameters that influence this decision is the number of available GC threads which is in turn based on the number of available CPUs on that machine. On said machine the JVM had only two GC threads which lead to such a small young generation size. The full analysis with more details is available on Github.

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.