Post-upgrade issues: 6.4.2 ES heap running away (6.5.0 too)

I have recently upgraded my cluster from 6.2.3 --> 6.4.2 (all components: elasticsearch, logstash, kibana)

Ever since then, we've been having issues with the heap space running out.

I have trimmed down indices all over, closed/deleted/etc.

facts:

linux:

Linux 4.14.67-66.56.amzn1.x86_64 #1 SMP Tue Sep 4 22:03:21 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Java

$ alternatives --display java
java - status is auto.
link currently points to /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java

2x nodes (bad i know, but not the issue at hand) with 8x CPU and 16GB ram, with jvm HEAP set to 9G on both. I have tried 8G as well, it doesnt make much difference.

Total of 586 primary shards and 586 replica shards.

After a service reboot, everything seems ok. but after a while the GC loop seems to "slow down" in the sense, and the sawtooth pattern gets smaller and smaller. See example here

during this time it seems OLD also spikes

before the upgrade i had close to triple the data (about 4000 total shards) across the same hardware, and while it was running hot, say around 85% heap, it wasnt crashing.

Post upgrade it seems to run for a while and then get hosed.

OS swap is turned off entirely.

Is there some sort of an oplog that is filled up somehow and then causes the runaway?

The really odd thing, is when the heap runs away as you see in the graph, normally (before) elasticsearch would crash with a heap javaoutofmemory exception. In this case, it doesnt crash, it just stops functioning quickly. The environment is so sluggish at that time, but not dead.

1 Like

to illustrate the problem further:

One more observation - this runaway issue seems to happen on the master node first. In my case though, it could also be because thats where all the primary shards live, since i am currently on a two node config (each reboot flushes all the shards to one side).

Were there any changes to how ES handles memory and/or master operations as relating to GC and heap util?

As i am watching the sawtooth get smaller and smaller, my old GC is starting to go up.
image

Another thing I am noticing, is that as the cpu usage goes up, one thread, is consistently staying very high on cpu util:


and as mentioned before, we're now in a gc loop.

[2018-10-04T21:33:12,050][WARN ][o.e.m.j.JvmGcMonitorService] [172.32.8.40] [gc][old][16067][490] duration [26.4s], collections [2]/[27.3s], total [26.4s]/[2.3m], memory [8.8gb]->[8.1gb]/[8.9gb], all_pools {[young] [473.5mb]->[64.9mb]/[532.5mb]}{[survivor] [51.3mb]->[0b]/[66.5mb]}{[old] [8.3gb]->[8gb]/[8.3gb]}
[2018-10-04T21:33:12,050][WARN ][o.e.m.j.JvmGcMonitorService] [172.32.8.40] [gc][16067] overhead, spent [26.4s] collecting in the last [27.3s]
[2018-10-04T21:33:44,722][WARN ][o.e.m.j.JvmGcMonitorService] [172.32.8.40] [gc][old][16068][491] duration [32s], collections [1]/[32.6s], total [32s]/[2.8m], memory [8.1gb]->[8.2gb]/[8.9gb], all_pools {[young] [64.9mb]->[121mb]/[532.5mb]}{[survivor] [0b]->[0b]/[66.5mb]}{[old] [8gb]->[8gb]/[8.3gb]}
[2018-10-04T21:33:44,722][WARN ][o.e.m.j.JvmGcMonitorService] [172.32.8.40] [gc][16068] overhead, spent [32.1s] collecting in the last [32.6s]
[2018-10-04T21:34:21,134][WARN ][o.e.m.j.JvmGcMonitorService] [172.32.8.40] [gc][old][16078][493] duration [26.5s], collections [2]/[27.4s], total [26.5s]/[3.3m], memory [8.5gb]->[8gb]/[8.9gb], all_pools {[young] [120.8mb]->[33.3mb]/[532.5mb]}{[survivor] [49.9mb]->[0b]/[66.5mb]}{[old] [8.3gb]->[8gb]/[8.3gb]}
[2018-10-04T21:34:21,134][WARN ][o.e.m.j.JvmGcMonitorService] [172.32.8.40] [gc][16078] overhead, spent [26.5s] collecting in the last [27.4s]

Anyone have any ideas? ive reduced the index content so low now, that this should not be happening. I've even rebooted the nodes earlier yesterday, but I am a bit at my wits end here.

Is there a memory leak somewhere?

I think I'd start by trying to take a heap dump when the process is near its death throes so we can see what's filling it all up. 500 shards per node on an 8GB heap still seems like quite a lot, depending on how you're using it.

most of the usage are daily log indices from *beats. Ive reconfigured my templates to create 1-shard per index from 2, and keep the replica, given all the curator jobs ive set, the cluster should drop half of the shards in the next 14 days.

what i am trying right now is using g1gc instead of cms gc. its about 4 hours in and the cluster is stable.

while thats being evaluated (as i understand g1gc is not officially supported yet, though aside from statements that its not supported, most of the user-feedback seems okay for heaps >=8G cases, as long as java is up to date and >=8) i am looking to lower CMSInitiatingOccupancyFraction from 75 (default) to about 50, maybe forcing the GC to be more aggressive.

Over the next few days I will probably swap back to cms and take a heap dump near the top as you recommended and get that here.

I saw somewhere that it was 1000 shards per 30gb ram. So for 8GB you need 200 shards. It depends on other factors like number of fields but something like that.

yeah, ive reduced number of total shards down to about 540, but the issue persists.

i upgraded boxes to java 10, issue persists.

ive tried lowering initiatingoccupancyfranction lower, issue persists.

i disabled xpack monitoring thinking the cluster state invocation every 10s was taxing the heap to omuch on the master node, the issue persists.

So. we pretty much wiped the cluster clean of data.

The heap is still running away:
image
This is always happening on the master node only.

[13:19:30][root]$ uname -a
Linux host-2 4.14.67-66.56.amzn1.x86_64 #1 SMP Tue Sep 4 22:03:21 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

[16:48:32][root]$ java -version
java version "10.0.2" 2018-07-17
Java(TM) SE Runtime Environment 18.3 (build 10.0.2+13)
Java HotSpot(TM) 64-Bit Server VM 18.3 (build 10.0.2+13, mixed mode)

both machines have 16GB of ram, is a compute-optimized EC2 instance class of c4.2xlarge

and here's the JVM config

[16:49:40][root]$ cat /etc/elasticsearch/jvm.options
-Xms8g
-Xmx8g

# GC configuration
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly

## optimizations

# pre-touch memory pages used by the JVM during initialization
-XX:+AlwaysPreTouch

## basic

# explicitly set the stack size
-Xss1m

# set to headless, just in case
-Djava.awt.headless=true

# ensure UTF-8 encoding by default (e.g. filenames)
-Dfile.encoding=UTF-8

# use our provided JNA always versus the system one
-Djna.nosys=true

# turn off a JDK optimization that throws away stack traces for common
# exceptions because stack traces are important for debugging
-XX:-OmitStackTraceInFastThrow

# flags to configure Netty
-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dio.netty.recycler.maxCapacityPerThread=0

# log4j 2
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true
-Djava.io.tmpdir=${ES_TMPDIR}

## heap dumps

# generate a heap dump when an allocation from the Java heap fails
# heap dumps are created in the working directory of the JVM
-XX:+HeapDumpOnOutOfMemoryError

# specify an alternative path for heap dumps; ensure the directory exists and
# has sufficient space
-XX:HeapDumpPath=/var/lib/elasticsearch

# specify an alternative path for JVM fatal error logs
-XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log

## JDK 8 GC logging

8:-XX:+PrintGCDetails
8:-XX:+PrintGCDateStamps
8:-XX:+PrintTenuringDistribution
8:-XX:+PrintGCApplicationStoppedTime
8:-Xloggc:/var/log/elasticsearch/gc.log
8:-XX:+UseGCLogFileRotation
8:-XX:NumberOfGCLogFiles=32
8:-XX:GCLogFileSize=64m

# JDK 9+ GC logging
9-:-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m
9-:-Djava.locale.providers=COMPAT

10-:-XX:UseAVX=2

and here's the es config

# ---------------------------------- Cluster -----------------------------------
#
cluster.name: elk-elasticsearch
#
# ------------------------------------ Node ------------------------------------
#
node.name: 172.32.9.177
node.max_local_storage_nodes: 1
#
# ----------------------------------- Paths ------------------------------------
#
#path.conf: /etc/elasticsearch
path.data: /data/elk-elasticsearch
path.logs: /logs
#
# ---------------------------------- Network -----------------------------------
#
network.host: ["172.32.9.177", localhost]
http.port: 9200
#
# --------------------------------- Discovery ----------------------------------
#
discovery.zen.ping.unicast.hosts: ["172.32.9.177","172.32.8.40"]
discovery.zen.minimum_master_nodes: 1
#
# ---------------------------------- Various -----------------------------------
#
action.destructive_requires_name: true
#
# -------------------------- Custom Chef Configuration --------------------------
gateway.expected_nodes: 0
transport.tcp.port: 9300

xpack.monitoring.enabled: false

while i understand that "more heap is better" a cluster with only 124 total shards (68 primaries) , 28GB net data and ~ 38M documents should not be killing a 16gb combined heap setup.

and i cannot stress this point enough, 6.2.3 did not have this issue. only after upgrade to 6.4.2 did this start. On 6.2.3 I was running close to 800GB of data and around 4000 shards day to day. And while performance wasnt super-zippy, it was stable.

I have tried various GC combinations, more ram assigned to heap from the OS (8/9/10), less ram, lower initialoccupancyfraction (50,60,70) everything leads to the same problem that can be displayed like this:

image

master node GC seems to leak memory somewhere in the heap, until it cant GC anymore, at which point old GC revs up like a lion, within about 2 hours gc time goes from 400ms out of 1s to about 28s out of 30s intervals, and then the cluster eventually runs outta heap, because OLD count never ever goes down.

I've added a 3rd node with 8gb heap (node stats are same: 16gb ram, 8x cpu)

we'll see if this helps now that cluster can rebalance itself.

i've also upgraded to java 11 since oracle said java 10 is no longer offered as lts

sigh the issue continues to persist.

Even with 3 nodes, 24GB combined heap (8gb heap per node), and only 250 shards total, the heap on master runs away.

I am at a complete loss at this point, and will most likely looking to roll back, or keep cron scripts around to bounce nodes every 3 hours.

I had something similar until I started setting my indices to read only and running forcemerge after they were a few days old. It seems that reducing the number of segments was my fix. It was on 6.x, I'm not yet on 6.4.

Yeah I was doing that on 6.2.x and that was enough, my cluster was loaded, sure, but the heap was not running away in any fashion, it would go up daily and when curator jobs run to snapshot + delete + forcemerge every night, it would go back down to manageable.

with this new change, something happened between 6.2.3 and 6.4.2 because, like ive shown, i don't see any reason why a 3x node cluster with so much ram and cpu can't handle the provided load.

even with upgrading java, and trying more GC combinations. it seems that none of that makes a difference.

this is starting to make me think this could be related to the way new es works with amazon linux.

unfortunately i do not have the bandwidth to rebuild a whole new cluster at this time on a different os.

This would be a useful next step.

I have some heaps, but as they are quite large, im not sure how i can upload them.

Otherwise my experience with deep diving into a jvm heap analysis is fairly top level, still figuring it out.

Do you have a whitepaper for how to crack one open, and what to look for?

I think I'd try MAT to open it, and I'd basically start by looking at what's using the highest proportion of memory and hope that this gives us a clue about where all the heap is going.

Update: I haven't yet had time to crack this open. After initially spending a lot of time, I have to work on other things. Ill keep this issue updated as I eventually will get around to looking at the heap analysis.

meanwhile, @DavidTurner since I believe this may be related to aws kernel (the only thing I can think of), do you guys have a list of kernels you support officially, along with java pairings?

We have a support matrix which shows support for Amazon Linux as well as a number of other Unixes. This means we test on these platforms, using reasonably recent kernel versions. I checked the build logs and currently the Amazon Linux tests are using 4.14.72-73.55.amzn2.x86_64.

What makes you think this is a kernel bug?

Can you push a heap dump to S3 and share a link so we can take a look?

(also we're not all guys :wink: )

I have got the same problem with heap. Today I finished rolling upgrade from 5.6.12 to 6.4.3 on all nodes and now on elected master node heap is fillig up very quickly. All 3 master nodes are dedicated with 4GB of heap.

Heap graph:

From log:
[2018-11-08T19:07:53,096][INFO ][o.e.m.j.JvmGcMonitorService] [masterela2] [gc][2432] overhead, spent [262ms] collecting in the last [1s]
[2018-11-08T19:09:12,486][INFO ][o.e.m.j.JvmGcMonitorService] [masterela2] [gc][2511] overhead, spent [264ms] collecting in the last [1s]
[2018-11-08T19:10:12,506][INFO ][o.e.m.j.JvmGcMonitorService] [masterela2] [gc][2571] overhead, spent [274ms] collecting in the last [1s]
[2018-11-08T19:10:42,581][INFO ][o.e.m.j.JvmGcMonitorService] [masterela2] [gc][2601] overhead, spent [287ms] collecting in the last [1s]
[2018-11-08T19:10:52,584][INFO ][o.e.m.j.JvmGcMonitorService] [masterela2] [gc][2611] overhead, spent [265ms] collecting in the last [1s]
[2018-11-08T19:11:02,587][INFO ][o.e.m.j.JvmGcMonitorService] [masterela2] [gc][2621] overhead, spent [260ms] collecting in the last [1s]
[2018-11-08T19:11:22,593][INFO ][o.e.m.j.JvmGcMonitorService] [masterela2] [gc][2641] overhead, spent [250ms] collecting in the last [1s]
[2018-11-08T19:11:32,596][INFO ][o.e.m.j.JvmGcMonitorService] [masterela2] [gc][2651] overhead, spent [259ms] collecting in the last [1s]
[2018-11-08T19:11:42,607][INFO ][o.e.m.j.JvmGcMonitorService] [masterela2] [gc][2661] overhead, spent [331ms] collecting in the last [1s]
[2018-11-08T19:11:52,614][INFO ][o.e.m.j.JvmGcMonitorService] [masterela2] [gc][2671] overhead, spent [266ms] collecting in the last [1s]
[2018-11-08T19:12:02,617][INFO ][o.e.m.j.JvmGcMonitorService] [masterela2] [gc][2681] overhead, spent [271ms] collecting in the last [1s]
[2018-11-08T19:12:12,619][INFO ][o.e.m.j.JvmGcMonitorService] [masterela2] [gc][2691] overhead, spent [267ms] collecting in the last [1s]
[2018-11-08T19:12:32,624][INFO ][o.e.m.j.JvmGcMonitorService] [masterela2] [gc][2711] overhead, spent [270ms] collecting in the last [1s]
[2018-11-08T19:13:02,689][INFO ][o.e.m.j.JvmGcMonitorService] [masterela2] [gc][2741] overhead, spent [272ms] collecting in the last [1s]

Found the problem. Such heap usage coused xpack monitoring. It seems my dedicated master nodes need more heap if I want to enable monitoring by xpack.

This command solved the problem:

PUT _cluster/settings
{
"persistent": {
"xpack.monitoring.collection.enabled" : false
}
}