Excessive garbage collection

davidkarlsen · April 10, 2018, 5:53pm

I have a very strange situation around GC:

x-pack will report around 50% heap use, and based on the graphs in x-pack it seems to do a full gc around every hour, still I am experiencing performance problems and seeing:

2018-04-10T19:45:38.994+0200: 121364.173: [GC (Allocation Failure) 2018-04-10T19:45:38.994+0200: 121364.173: [ParNew
Desired survivor size 8716288 bytes, new threshold 1 (max 6)

age 1: 12705144 bytes, 12705144 total
age 2: 491144 bytes, 13196288 total
age 3: 524104 bytes, 13720392 total
age 4: 2365904 bytes, 16086296 total
age 5: 919392 bytes, 17005688 total
: 144916K->17024K(153344K), 0.0804029 secs] 9977937K->9856140K(18857344K), 0.0805133 secs] [Times: user=0.15 sys=0.00, real=0.08 secs]
2018-04-10T19:45:39.075+0200: 121364.254: Total time for which application threads were stopped: 0.0821092 seconds, Stopping threads took: 0.0012391 seconds
2018-04-10T19:45:42.770+0200: 121367.949: [GC (Allocation Failure) 2018-04-10T19:45:42.770+0200: 121367.949: [ParNew
Desired survivor size 8716288 bytes, new threshold 6 (max 6)
age 1: 6423872 bytes, 6423872 total
: 153344K->13199K(153344K), 0.1366576 secs] 9992460K->9864645K(18857344K), 0.1367784 secs] [Times: user=0.13 sys=0.00, real=0.14 secs]
2018-04-10T19:45:42.907+0200: 121368.086: Total time for which application threads were stopped: 0.1494425 seconds, Stopping threads took: 0.0123295 seconds
2018-04-10T19:45:42.974+0200: 121368.153: [GC (Allocation Failure) 2018-04-10T19:45:42.974+0200: 121368.153: [ParNew
Desired survivor size 8716288 bytes, new threshold 2 (max 6)
age 1: 6224448 bytes, 6224448 total
age 2: 5998824 bytes, 12223272 total
: 149519K->16954K(153344K), 0.0651545 secs] 10000965K->9868400K(18857344K), 0.0652718 secs] [Times: user=0.12 sys=0.00, real=0.06 secs]
2018-04-10T19:45:43.039+0200: 121368.218: Total time for which application threads were stopped: 0.0664671 seconds, Stopping threads took: 0.0008314 seconds
2018-04-10T19:45:43.096+0200: 121368.275: [GC (Allocation Failure) 2018-04-10T19:45:43.096+0200: 121368.275: [ParNew
Desired survivor size 8716288 bytes, new threshold 6 (max 6)
age 1: 868512 bytes, 868512 total
age 2: 5690600 bytes, 6559112 total
: 153274K->10365K(153344K), 0.0873122 secs] 10004720K->9867035K(18857344K), 0.0874245 secs] [Times: user=0.16 sys=0.00, real=0.08 secs]
2018-04-10T19:45:43.184+0200: 121368.362: Total time for which application threads were stopped: 0.0886560 seconds, Stopping threads took: 0.0008844 seconds
2018-04-10T19:45:43.420+0200: 121368.599: [GC (Allocation Failure) 2018-04-10T19:45:43.420+0200: 121368.599: [ParNew
Desired survivor size 8716288 bytes, new threshold 1 (max 6)
age 1: 11281104 bytes, 11281104 total
age 2: 558224 bytes, 11839328 total
age 3: 5535280 bytes, 17374608 total
: 146242K->17024K(153344K), 0.0668974 secs] 10002912K->9875064K(18857344K), 0.0670153 secs] [Times: user=0.12 sys=0.00, real=0.07 secs]
in the gc log

e.g. it seems to do gc very frequently (and not removing much)

even though the VM is given 18G:
SLsl 719:18 /usr/lib/jvm/jre-1.8.0-openjdk/bin/java -Xms1g -Xmx1g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.io.tmpdir=/tmp/elasticsearch.yloV52c4 -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -Xloggc:logs/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=32 -XX:GCLogFileSize=64m -Des.cgroups.hierarchy.override=/ -Xms18g -Xmx18g -Des.path.home=/usr/share/elasticsearch -Des.path.conf=/usr/share/elasticsearch/config -cp /usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch -Ebootstrap.memory_lock=true

due to this searches in kibana spanning quite a bit of data will fail on timeouts, and the dreaded

[HUCBRNR] [gc][121100] overhead, spent [266ms] collecting in the last [1s]

appears in the log.

what could be the cause of this?

The instance is indexing between 300-700 events a second

Badger · April 10, 2018, 6:17pm

The new generation is surprisingly small, and that is what is driving all the ParNew activity. Can you remove these? I am wondering if that is influencing the choice of NewGen size even though they are overriden later.

davidkarlsen · April 10, 2018, 6:53pm

Unfortunately not, it's hardcoded by elastic, see: https://github.com/elastic/elasticsearch-docker/issues/151

Badger · April 10, 2018, 7:55pm

That's OK, that is not the issue. Something is deliberately setting MaxNewSize low. With a 6 GB heap then I see this in my gc.log ... -XX:MaxNewSize=174485504. That seems crazy low to me but it wasn't put there by the JVM option fairy, so there must be a reason.

If I add -XX:MaxNewSize=2147483648 to my jvm.options then it is respected. I still get terrible performance from ParNew, just a lot less frequently.

Personally I run logstash using the G1 garbage collector, but I believe that is unsupported by Elastic.

davidkarlsen · April 11, 2018, 7:05pm

This is elastic, not logstash. I have looked more into the figures, and it seems the young generation is running very often, but there is a lot of free heap, and the old generation runs only about twice an hour.
I tried removing the -Xmx1g and -Xms1g that's there by default in the settings, but it did not influence it in any way. I think the defaults set are not sane.

The args now are:
/usr/lib/jvm/jre-1.8.0-openjdk/bin/java -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.io.tmpdir=/tmp/elasticsearch.zq8Jfdle -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -Xloggc:logs/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=32 -XX:GCLogFileSize=64m -Des.cgroups.hierarchy.override=/ -Xms18g -Xmx18g -Des.path.home=/usr/share/elasticsearch -Des.path.conf=/usr/share/elasticsearch/config -cp /usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch -Ebootstrap.memory_lock=true

Badger · April 11, 2018, 7:16pm

In logs/gc.log the third line should have the command line flags. If it does not you can enable it using -XX:+PrintCommandLineFlags. What is the value of -XX:MaxNewSize= ?

davidkarlsen · April 11, 2018, 7:19pm

fgrep MaxNewSize *
gc.log.0.current:CommandLine flags: -XX:+AlwaysPreTouch -XX:CMSInitiatingOccupancyFraction=75 -XX:GCLogFileSize=67108864 -XX:+HeapDumpOnOutOfMemoryError -XX:InitialHeapSize=19327352832 -XX:MaxHeapSize=19327352832 -XX:MaxNewSize=174485504 -XX:MaxTenuringThreshold=6 -XX:NewSize=174485504 -XX:NumberOfGCLogFiles=32 -XX:OldPLABSize=16 -XX:OldSize=348971008 -XX:-OmitStackTraceInFastThrow -XX:+PrintGC -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:ThreadStackSize=1024 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseGCLogFileRotation -XX:+UseParNewGC

Badger · April 11, 2018, 7:25pm

To me telling the JVM to use 166 MB out of an 18 GB heap for the new generation is crazy. I wish I could find where that gets set, or even better, why someone thought it a good idea to set it. They must have had a reason.

What do things look like if you set -XX:MaxNewSize=2147483648 in your jvm.options?

dliappis · April 12, 2018, 7:59am

Just a minor clarification as mentioned in https://github.com/elastic/elasticsearch-docker/issues/151#issuecomment-368113213, overriding the heap settings in the Elasticsearch Docker image works and you should be able to confirm this either by:

checking the Elasticsearch log about the heap size when it starts. For example, below I've started with "ES_JAVA_OPTS=-Xms3G -Xmx3G":
[2018-04-12T07:39:06,417][INFO ][o.e.e.NodeEnvironment ] [4b5QpNp] heap size [2.9gb], compressed ordinary object pointers [true]

or probing the API, e.g.:

$ curl -s localhost:39200/_nodes/_all | jq .nodes[].jvm.mem
{
"heap_init_in_bytes": 3221225472,
"heap_max_in_bytes": 3133997056,
"non_heap_init_in_bytes": 2555904,
"non_heap_max_in_bytes": 0,
"direct_max_in_bytes": 3133997056
}

or using filter_path:

curl -s 'localhost:39200/_nodes/_all?filter_path=nodes.*.jvm.mem.heap*&pretty'
{
   "nodes" : {
    "yLFcU47RSHOkg-pLZsmZnA" : {
      "jvm" : {
        "mem" : {
          "heap_init_in_bytes" : 3221225472,
          "heap_max_in_bytes" : 3133997056
        }
      }
    }
  }
}

or in the ps output you've included:

/usr/lib/jvm/jre-1.8.0-openjdk/bin/java -Xms1g -Xmx1g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.io.tmpdir=/tmp/elasticsearch.yloV52c4 -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -Xloggc:logs/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=32 -XX:GCLogFileSize=64m -Des.cgroups.hierarchy.override=/ -Xms18g -Xmx18g -Des.path.home=/usr/share/elasticsearch -Des.path.conf=/usr/share/elasticsearch/config -cp /usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch -Ebootstrap.memory_lock=true

where you can see the parameters -Xms18g -Xmx18g defined before -Des.path.home.

davidkarlsen · April 12, 2018, 8:05am

Yes the actual heapsize can be set with the parameters, but what is not so good are the sizings of the young gen, see comments from Badger above on "MaxNewSize"

danielmitterdorfer · April 16, 2018, 6:36am

Just for reference for future readers that might find this thread: The JVM determines the size of the young generation ergonomically (i.e. based on the capabilities of the hardware). One of the parameters that influence this decision is the number of available GC threads which is in turn based on the number of available CPUs on that machine. On said machine the JVM had only two GC threads which lead to such a small young generation size. The full analysis with more details is available on Github.

system · May 14, 2018, 6:36am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch fills the heap then spends all its time doing garbage collection Elasticsearch	5	5304	July 6, 2017
Heap usage holds steady at max and GC does not run. Need to force restart the cluster Elasticsearch elastic-stack-monitoring	5	548	July 6, 2023
Heap / GC Issues Elasticsearch	9	480	July 6, 2017
GC out of control on 12 node elastic 5.4 cluster Elasticsearch	1	479	August 7, 2017
Large heap usage with each node Elasticsearch	15	3692	July 5, 2017

Excessive garbage collection

Related topics