ES 2.4 to 5.2 Upgrade Followed By Major Cluster Instability

mstruve · March 24, 2017, 7:45pm

Hi,

We recently upgraded our cluster from ES 2.4 to ES 5.2. The upgrade went smoothly but upon starting everything back up we noticed a lot of instability particularly with garbage collecting. What used to be a very stable and reliable cluster is now dropping nodes due to heap and cpu spikes. Usage, indexes, docs, shards, hardware are all the same.

We are currently trying to get to the bottom of why we are suddenly so unstable and why our heap usage seems out of control but are not having much luck. So snippets that might be useful

Avg Indexing rate 1k docs/sec (can get to 3k/s)
refresh rate=30s to accommodate indexing
Some heavy aggregations but nothing that has caused problems in the past
java version "1.8.0_31"

Was there a change in ES 5 that could cause heavy aggregations to suddenly become more of a burden? Or was there a change that could have caused large amounts of indexing to cause heap and cpu spikes?

Any suggestions at all would be greatly appreciated! Let me know if more information is needed!

Thanks!

dbortz · March 24, 2017, 9:35pm

I am also part of the team working on this upgrade. A few additional notes that might be helpful:

We have fairly large documents, with 25-30 fields in some of them.
We have most of our data on a single large index, as opposed to log-style data where a new time-based (daily, etc) index is created.
We very heavily utilize parent-child relationships in ES, with about a 250-1 ratio between children and parents.
We have around 15 data nodes in the cluster, with 80 shards on the large index. Some time ago, we were told that we should reduce our shard count for better performance, but when we dropped down to 15 shards, we had markedly worse performance on ES 2.4, so we went back up to 80.

Are there any changes in ES 5 that might be particularly challenging for a dataset like ours? Any advice would be helpful.

Thanks!

jasontedor · March 25, 2017, 4:04am

Am I understanding you correctly that nodes are dropping due to out of memory in the Java heap space? Can you share a stacktrace?

jdoss · March 25, 2017, 4:51am

@jasontedor I am on this team and I grabbed some logs from one of our data nodes that dropped out of the cluster. It looks like it was GCing hard and it lost contact with the master and timed out. We are also trying to do rolling ES restart right now to add in some X-Pack Monitoring and we have our first data node stuck on initializing it's shards for about an hour now.

[2017-03-24T17:52:02,558][WARN ][o.e.i.c.IndicesClusterStateService] [node-06] [[our_cool_index][76]] marking and sending shard failed due to [failed to create shard]
java.io.IOException: failed to obtain in-memory shard lock
        at org.elasticsearch.index.IndexService.createShard(IndexService.java:367) ~[elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:476) ~[elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:146) ~[elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:542) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:519) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:204) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.cluster.service.ClusterService.callClusterStateAppliers(ClusterService.java:856) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.cluster.service.ClusterService.publishAndApplyChanges(ClusterService.java:810) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.cluster.service.ClusterService.runTasks(ClusterService.java:628) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:1112) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:527) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:238) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:201) [elasticsearch-5.2.2.jar:5.2.2]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_31]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_31]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_31]
Caused by: org.elasticsearch.env.ShardLockObtainFailedException: [our_cool_index][76]: obtaining shard lock timed out after 5000ms
        at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:712) ~[elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:631) ~[elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.index.IndexService.createShard(IndexService.java:297) ~[elasticsearch-5.2.2.jar:5.2.2]
        ... 15 more
[2017-03-24T17:52:02,639][WARN ][o.e.t.TransportService   ] [node-06] Received response for a request that has timed out, sent [64051ms] ago, timed out [21574ms] ago, action [internal:discovery/zen/fd/master_ping], node [{site-es-master-02}{D0LIYxy-RBqbrhEbKOP6WA}{_14q7sDQSvSFYx4tJSkWaQ}{10.0.11.165}{10.0.11.1
65:9300}{aws_availability_zone=us-west-2b}], id [4061802]
[2017-03-24T17:52:07,615][INFO ][o.e.i.s.TransportNodesListShardStoreMetaData] [node-06] [our_cool_index][76]: failed to obtain shard lock
org.elasticsearch.env.ShardLockObtainFailedException: [our_cool_index][76]: obtaining shard lock timed out after 5000ms
        at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:712) ~[elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:631) ~[elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData$$Lambda$1506/1600564609.lock(Unknown Source) ~[?:?]
        at org.elasticsearch.index.store.Store.readMetadataSnapshot(Store.java:383) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.listStoreMetaData(TransportNodesListShardStoreMetaData.java:153) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:112) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.indices.store.TransportNodesListShardStoreMetaData.nodeOperation(TransportNodesListShardStoreMetaData.java:64) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:145) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:270) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:266) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1488) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:596) [elasticsearch-5.2.2.jar:5.2.2]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.2.2.jar:5.2.2]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_31]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_31]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_31]

mstruve · March 25, 2017, 1:57pm

As jdoss stated above the nodes are spending almost every second doing garbage collecting and every couple of minutes will take 10-20s to garbage collect and that is when we se them drop requests and/or drop off the cluster.

jasontedor · March 25, 2017, 2:26pm

Okay, so not heaping, just GCing hard. Have you taken a heap dump to understand why the heap is under so much pressure? You mentioned heavy aggregations, that could indeed be a cause but let's check, you know, for science. What type of aggregations are you running? An earlier post mentioned 80 shards for an index, which is likely excessive. How large is each shard? Can you please share your JVM settings (jps -l -m -v would be my preference so we're sure we see the flags on the actual running JVM)?

mstruve · March 25, 2017, 2:52pm

We started looking at heap bc it was growing so quickly that it caused the excessive GCing
Here are the JVM settings we use.

11458 org.elasticsearch.bootstrap.Elasticsearch -d -p /var/run/elasticsearch/elasticsearch.pid -Edefault.path.logs=/elasticsearch/logs -Edefault.path.data=/elasticsearch/data -Edefault.path.conf=/elasticsearch/etc -Xms30193m -Xmx30193m -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+DisableExplicitGC -XX:+AlwaysPreTouch -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Dlog4j.skipJansi=true -XX:+HeapDumpOnOutOfMemoryError -Des.path.home=/usr/local/elasticsearch-5.2.2

Here is a list of instance counts from a heap dump we took yesterday. That was taken immediately after a GC so I don't know if it is as helpful as it would have been had we been able to grab it during a spike. None of the team is familiar with what exactly the heap should look like so any insight into that would be helpful.

Thanks for the quick responses!

mstruve · March 25, 2017, 3:17pm

I think it also might be important to note that our main index which is at 80 shards has about 15gb/shard right now. We have thought about lowering the shard count so we tried it on a smaller client index but we see a big dip in search performance(2-3 times slower with half the shards). From what we have read a good limit for shard size is 32gb and given we want room to grow 80 seems like an ok number for our main index. We have split off some client data into there own indexes so we do have multiple indexes and our total shard count across all of them is 280 primaries, each with a single replica but we are only ever searching a single index at a time the way our data is setup.

Once again, this cluster was solid as a rock before the upgade.

jasontedor · March 25, 2017, 4:51pm

You're missing a crucial JVM setting. Can you please add:

-Dio.netty.recycler.maxCapacityPerThread=0

and restart your nodes and let me know if it helps? If not, I have more to say but let's start there.

mstruve · March 26, 2017, 12:22am

We adjusted the recycler setting and did a full cluster restart and so far it seems a little better but not much. We are still seeing heap spikes followed by 10+sec GC times which are causing nodes to drop requests a lot more than we would like.

What other thoughts do you have?

jasontedor · March 26, 2017, 12:58am

Okay, that's unfortunate and I'm sorry that didn't alleviate your troubles. The Netty recycler has unfortunately caused a lot of pain in 5.x, so much that we ship with it disabled by default now since #22452 and I was hoping we'd get off with that being the solution here.

Time to roll up our sleeves. You mentioned heap spikes. Can you correlate these with any activity on the cluster? These 10s pauses, let's check some obvious things. Is swap disabled? How much RAM do you have? What are your CPU specs? Can you tell me what aggregations you're running? What is your request pattern (do you have dedicated query nodes, or do you query through the data nodes, and if so, do you round robin them)? At some point I'd like to get a heap dump from you when the heap is under pressure (we can arrange a private transfer, but it won't do us any good right now as I'm traveling this weekend without a laptop so we can work on that on Monday if we still haven't resolved this). One puzzling thing to me is how many atomic booleans are in your histogram. I can't think where those would come from. They're not going to consume a ton of heap, it's just the sheer number that seems off. I'd like to trace their path to a GC root and see if they have a common source.

As for shard sizes, I'm not sure where 32g came from, we often say 50g, but of course these things can vary.

mstruve · March 26, 2017, 1:55am

We have DM'ed you all the information you requested. Given that we are a security company we would like to keep some of it private. Once we have reached a solution we will gladly make sure this thread is updated to help others.

Thanks!

jasontedor · March 26, 2017, 2:12am

This is fine. I have no problem with this at all.

AndreCimander · March 26, 2017, 6:56am

Hey there,

are you using random_score in your queries? We had similar problems after upgrading our Cluster to 5.x and had to give up on random-sampling, our parent-child ratio is around 1:10000 and loading all _parent ids onto the heap was killing the nodes.

We had to move to a pre-computed "random score" field which we have to update regulary on our 5 billion documents... at least we have a nice SSD write endurance test

mstruve · March 26, 2017, 12:44pm

Hey!

We do not use random score. We actually dont do any scoring and just stick to the filter context for all of our queries. Our parent/child ratio is about 1/200 on average and we do aggregate on the _parent field often during some of our aggregations, not sure if that might be an issue.

It is interesting you mention a large number of IDs for random score. We recently denormalized some data into its own index(250k uniq chunks of info that were nested across all 855 million child docs) and often have to lookup our child document using a large number of ids from the denormalized data. Usually a few thousand IDs but sometimes more. On 2.x it was a piece of cake but maybe with 5 thats hurting us.

AndreCimander · March 26, 2017, 1:14pm

Hm, you could try to reduce the query cache size. Values above 4% were
never stable for us, we are sitting at 3% atm.

mstruve · March 27, 2017, 12:07am

Thanks for the suggestion Andre! We tried it out by decreasing our cache from 10% to 5% and didn't see a whole lot of notable difference. The heap still would spike followed by 10+sec GC times. We also tried turning the cache off completely on one index to see if that would help and it did not. Once again we had none of these problems on 2.4, we were solid as a rock.

which leads to

jdoss · March 27, 2017, 1:22am

Just for the record, we upgraded Java from 1.8.0_31 as mentioned in @mstruve's original post, to the recommended 1.8.0_121 version on all nodes. We saw no changes in our issues, but we figured we would rule out any possible issues with not running the officially recommended version.

$ java -version
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)

itaydvir · March 28, 2017, 8:25am

I am experiencing exactly the same thing!
Also trying to upgrade from ES 2.3 to 5.2.2 (first try was to 5.2 but experienced this issue, so tried to 5.2.2 that included ES bug fixes + it uses lucene 6.4.1 which had a fix for memory leak).

My cluster has 3 master nodes and 5 data nodes.
Data nodes are 32 GB servers running elastic with 15GB heap.
We've been running on a cluster of elastic 2.3 for a long time and its rock solid! App and usages are exactly the same and these issues occurred only after the upgrade to 5.2.2.

Still couldn't find a solution... monitoring wise, everything seems to be very normal. everything runs smooth for few hours after deployment and all of a sudden heap jumps quickly to MAX on all data nodes and the cluster crashes.

Would be happy to share more information and app usages and it might help us both to map the problem and reach a solution.

mstruve · March 28, 2017, 5:08pm

@itaydvir do you have parent/child relationships in your mappings? We were told their is a bug in the query cacheing for those.

We were also told it might help to turn off the query cacheing on each index and clear the caches bc there is a lucene bug that can cause memory leaks. We did that this morning and were good for a few hours then exactly what you described just happened to us. All the nodes spikes to 100% heap and we couldnt get them to respond to any request. Currently we are shutting down the whole cluster to try and recover. It's not good.

We grabbed a heap dump during the spike which hopefully some ES devs can look at and maybe find the problem. We will keep you posted as we find out more. Currently we are building a new 2.x cluster preparing to rollback.

Topic		Replies	Views
ES 5.2.2: Sudden heap spikes followed by cluster crash Elasticsearch	15	5309	June 8, 2017
Regarding upgrading Elastic Search server from 0.18.3 Elasticsearch	8	365	July 6, 2017
Post-upgrade issues: 6.4.2 ES heap running away (6.5.0 too) Elasticsearch	42	4060	January 10, 2019
Problem with heap space overusage Elasticsearch	7	1898	July 6, 2017
ElasticSearch timeouts after upgrade to 6.5.1 Elasticsearch	19	1646	January 11, 2019

ES 2.4 to 5.2 Upgrade Followed By Major Cluster Instability

Related topics