ES 2.4 to 5.2 Upgrade Followed By Major Cluster Instability

Hi,

We recently upgraded our cluster from ES 2.4 to ES 5.2. The upgrade went smoothly but upon starting everything back up we noticed a lot of instability particularly with garbage collecting. What used to be a very stable and reliable cluster is now dropping nodes due to heap and cpu spikes. Usage, indexes, docs, shards, hardware are all the same.

We are currently trying to get to the bottom of why we are suddenly so unstable and why our heap usage seems out of control but are not having much luck. So snippets that might be useful

  • Avg Indexing rate 1k docs/sec (can get to 3k/s)
  • refresh rate=30s to accommodate indexing
  • Some heavy aggregations but nothing that has caused problems in the past
  • java version "1.8.0_31"

Was there a change in ES 5 that could cause heavy aggregations to suddenly become more of a burden? Or was there a change that could have caused large amounts of indexing to cause heap and cpu spikes?

Any suggestions at all would be greatly appreciated! Let me know if more information is needed!

Thanks!

I am also part of the team working on this upgrade. A few additional notes that might be helpful:

  • We have fairly large documents, with 25-30 fields in some of them.
  • We have most of our data on a single large index, as opposed to log-style data where a new time-based (daily, etc) index is created.
  • We very heavily utilize parent-child relationships in ES, with about a 250-1 ratio between children and parents.
  • We have around 15 data nodes in the cluster, with 80 shards on the large index. Some time ago, we were told that we should reduce our shard count for better performance, but when we dropped down to 15 shards, we had markedly worse performance on ES 2.4, so we went back up to 80.

Are there any changes in ES 5 that might be particularly challenging for a dataset like ours? Any advice would be helpful.

Thanks!

1 Like

Am I understanding you correctly that nodes are dropping due to out of memory in the Java heap space? Can you share a stacktrace?

@jasontedor I am on this team and I grabbed some logs from one of our data nodes that dropped out of the cluster. It looks like it was GCing hard and it lost contact with the master and timed out. We are also trying to do rolling ES restart right now to add in some X-Pack Monitoring and we have our first data node stuck on initializing it's shards for about an hour now.

[2017-03-24T17:52:02,558][WARN ][o.e.i.c.IndicesClusterStateService] [node-06] [[our_cool_index][76]] marking and sending shard failed due to [failed to create shard]
java.io.IOException: failed to obtain in-memory shard lock
        at org.elasticsearch.index.IndexService.createShard(IndexService.java:367) ~[elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:476) ~[elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:146) ~[elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:542) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:519) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:204) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.cluster.service.ClusterService.callClusterStateAppliers(ClusterService.java:856) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.cluster.service.ClusterService.publishAndApplyChanges(ClusterService.java:810) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.cluster.service.ClusterService.runTasks(ClusterService.java:628) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:1112) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:527) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:238) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:201) [elasticsearch-5.2.2.jar:5.2.2]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_31]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_31]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_31]
Caused by: org.elasticsearch.env.ShardLockObtainFailedException: [our_cool_index][76]: obtaining shard lock timed out after 5000ms
        at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:712) ~[elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:631) ~[elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.index.IndexService.createShard(IndexService.java:297) ~[elasticsearch-5.2.2.jar:5.2.2]
        ... 15 more
[2017-03-24T17:52:02,639][WARN ][o.e.t.TransportService   ] [node-06] Received response for a request that has timed out, sent [64051ms] ago, timed out [21574ms] ago, action [internal:discovery/zen/fd/master_ping], node [{site-es-master-02}{D0LIYxy-RBqbrhEbKOP6WA}{_14q7sDQSvSFYx4tJSkWaQ}{10.0.11.165}{10.0.11.1
65:9300}{aws_availability_zone=us-west-2b}], id [4061802]
[2017-03-24T17:52:07,615][INFO ][o.e.i.s.TransportNodesListShardStoreMetaData] [node-06] [our_cool_index][76]: failed to obtain shard lock
org.elasticsearch.env.ShardLockObtainFailedException: [our_cool_index][76]: obtaining shard lock timed out after 5000ms
        at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:712) ~[elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:631) ~[elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData$$Lambda$1506/1600564609.lock(Unknown Source) ~[?:?]
        at org.elasticsearch.index.store.Store.readMetadataSnapshot(Store.java:383) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.listStoreMetaData(TransportNodesListShardStoreMetaData.java:153) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:112) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:64) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:145) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:270) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:266) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1488) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:596) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.2.2.jar:5.2.2]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_31]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_31]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_31]

As jdoss stated above the nodes are spending almost every second doing garbage collecting and every couple of minutes will take 10-20s to garbage collect and that is when we se them drop requests and/or drop off the cluster.

Okay, so not heaping, just GCing hard. Have you taken a heap dump to understand why the heap is under so much pressure? You mentioned heavy aggregations, that could indeed be a cause but let's check, you know, for science. What type of aggregations are you running? An earlier post mentioned 80 shards for an index, which is likely excessive. How large is each shard? Can you please share your JVM settings (jps -l -m -v would be my preference so we're sure we see the flags on the actual running JVM)?

We started looking at heap bc it was growing so quickly that it caused the excessive GCing
Here are the JVM settings we use.

11458 org.elasticsearch.bootstrap.Elasticsearch -d -p /var/run/elasticsearch/elasticsearch.pid -Edefault.path.logs=/elasticsearch/logs -Edefault.path.data=/elasticsearch/data -Edefault.path.conf=/elasticsearch/etc -Xms30193m -Xmx30193m -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+DisableExplicitGC -XX:+AlwaysPreTouch -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Dlog4j.skipJansi=true -XX:+HeapDumpOnOutOfMemoryError -Des.path.home=/usr/local/elasticsearch-5.2.2

Here is a list of instance counts from a heap dump we took yesterday. That was taken immediately after a GC so I don't know if it is as helpful as it would have been had we been able to grab it during a spike. None of the team is familiar with what exactly the heap should look like so any insight into that would be helpful.

Thanks for the quick responses!

I think it also might be important to note that our main index which is at 80 shards has about 15gb/shard right now. We have thought about lowering the shard count so we tried it on a smaller client index but we see a big dip in search performance(2-3 times slower with half the shards). From what we have read a good limit for shard size is 32gb and given we want room to grow 80 seems like an ok number for our main index. We have split off some client data into there own indexes so we do have multiple indexes and our total shard count across all of them is 280 primaries, each with a single replica but we are only ever searching a single index at a time the way our data is setup.

Once again, this cluster was solid as a rock before the upgade.

You're missing a crucial JVM setting. Can you please add:

-Dio.netty.recycler.maxCapacityPerThread=0

and restart your nodes and let me know if it helps? If not, I have more to say but let's start there.

We adjusted the recycler setting and did a full cluster restart and so far it seems a little better but not much. We are still seeing heap spikes followed by 10+sec GC times which are causing nodes to drop requests a lot more than we would like.

What other thoughts do you have?

Okay, that's unfortunate and I'm sorry that didn't alleviate your troubles. The Netty recycler has unfortunately caused a lot of pain in 5.x, so much that we ship with it disabled by default now since #22452 and I was hoping we'd get off with that being the solution here.

Time to roll up our sleeves. You mentioned heap spikes. Can you correlate these with any activity on the cluster? These 10s pauses, let's check some obvious things. Is swap disabled? How much RAM do you have? What are your CPU specs? Can you tell me what aggregations you're running? What is your request pattern (do you have dedicated query nodes, or do you query through the data nodes, and if so, do you round robin them)? At some point I'd like to get a heap dump from you when the heap is under pressure (we can arrange a private transfer, but it won't do us any good right now as I'm traveling this weekend without a laptop so we can work on that on Monday if we still haven't resolved this). One puzzling thing to me is how many atomic booleans are in your histogram. I can't think where those would come from. They're not going to consume a ton of heap, it's just the sheer number that seems off. I'd like to trace their path to a GC root and see if they have a common source.

As for shard sizes, I'm not sure where 32g came from, we often say 50g, but of course these things can vary.

We have DM'ed you all the information you requested. Given that we are a security company we would like to keep some of it private. Once we have reached a solution we will gladly make sure this thread is updated to help others.

Thanks!

This is fine. I have no problem with this at all.

Hey there,

are you using random_score in your queries? We had similar problems after upgrading our Cluster to 5.x and had to give up on random-sampling, our parent-child ratio is around 1:10000 and loading all _parent ids onto the heap was killing the nodes.

We had to move to a pre-computed "random score" field which we have to update regulary on our 5 billion documents... at least we have a nice SSD write endurance test :slight_smile:

Hey!

We do not use random score. We actually dont do any scoring and just stick to the filter context for all of our queries. Our parent/child ratio is about 1/200 on average and we do aggregate on the _parent field often during some of our aggregations, not sure if that might be an issue.

It is interesting you mention a large number of IDs for random score. We recently denormalized some data into its own index(250k uniq chunks of info that were nested across all 855 million child docs) and often have to lookup our child document using a large number of ids from the denormalized data. Usually a few thousand IDs but sometimes more. On 2.x it was a piece of cake but maybe with 5 thats hurting us.

Hm, you could try to reduce the query cache size. Values above 4% were
never stable for us, we are sitting at 3% atm.

Thanks for the suggestion Andre! We tried it out by decreasing our cache from 10% to 5% and didn't see a whole lot of notable difference. The heap still would spike followed by 10+sec GC times. We also tried turning the cache off completely on one index to see if that would help and it did not. Once again we had none of these problems on 2.4, we were solid as a rock.

which leads to

Just for the record, we upgraded Java from 1.8.0_31 as mentioned in @mstruve's original post, to the recommended 1.8.0_121 version on all nodes. We saw no changes in our issues, but we figured we would rule out any possible issues with not running the officially recommended version.

$ java -version
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)

I am experiencing exactly the same thing!
Also trying to upgrade from ES 2.3 to 5.2.2 (first try was to 5.2 but experienced this issue, so tried to 5.2.2 that included ES bug fixes + it uses lucene 6.4.1 which had a fix for memory leak).

My cluster has 3 master nodes and 5 data nodes.
Data nodes are 32 GB servers running elastic with 15GB heap.
We've been running on a cluster of elastic 2.3 for a long time and its rock solid! App and usages are exactly the same and these issues occurred only after the upgrade to 5.2.2.

Still couldn't find a solution... monitoring wise, everything seems to be very normal. everything runs smooth for few hours after deployment and all of a sudden heap jumps quickly to MAX on all data nodes and the cluster crashes.

Would be happy to share more information and app usages and it might help us both to map the problem and reach a solution.

@itaydvir do you have parent/child relationships in your mappings? We were told their is a bug in the query cacheing for those.

We were also told it might help to turn off the query cacheing on each index and clear the caches bc there is a lucene bug that can cause memory leaks. We did that this morning and were good for a few hours then exactly what you described just happened to us. All the nodes spikes to 100% heap and we couldnt get them to respond to any request. Currently we are shutting down the whole cluster to try and recover. It's not good. :grimacing:

We grabbed a heap dump during the spike which hopefully some ES devs can look at and maybe find the problem. We will keep you posted as we find out more. Currently we are building a new 2.x cluster preparing to rollback.