Elasticsearch upgrade to 5.2 causes major degraded performance

edalford11 · February 22, 2017, 3:59pm

Our cluster performed pretty much near perfect on 2.4. After upgrading to 5.2 we are seeing time spent in elastic increased by 6x, basically making our service useless as it is too slow. The only thing in the logs are a bunch of these

[2017-02-22T12:23:27,199][INFO ][o.e.m.j.JvmGcMonitorService] [search1] [gc][32939] overhead, spent [273ms] collecting in the last [1s]
[2017-02-22T12:23:47,273][WARN ][o.e.m.j.JvmGcMonitorService] [search1] [gc][32959] overhead, spent [600ms] collecting in the last [1s]

which indicate lots of time spent garbage collecting. This doesn't give much of a clue as I've gone through the docs many times searching for answers and am not coming up with any. I've confirmed the right amount of heap space is being allocated (31GB on a 64GB machine) with 8 nodes. This seems to be a common complaint after going through the forum but not many people seem to be getting anywhere. Is this a known issue or is there an obvious clue here I am missing? Any help is appreciated. Thank you.

Also want to point out all nodes are running openjdk version "1.8.0_111". I know the docs say 1.8.0_73 is recommended but I'm not sure if this matters since its the same major version.

Attila_Nagy · February 22, 2017, 7:48pm

Does hot_threads show something interesting?

edalford11 · February 28, 2017, 3:12pm

When we first tested this, we had a shard count of about 4500 but have since lowered it to about 1000 just by getting rid of some monthly indices and turning them into yearly indices. (We are a time series use case). This reduced the overall shard search load, which we saw an improvement in time spent in elastic but we are still having the same problem. The cluster just doesn't hold up in production at scale.

The logs indicate the search queue capacity is getting reached, which makes sense because the 5.2 version is spending between 5-10ms in elastic where the 2.4 version was only spending about 3.5ms. Our app makes heavy use of the search queue.

The CPU on the nodes hover between 20 and 25%, where the 2.4 cluster the CPU is slightly lower but tends to reach the same levels.

The logs display lines like these

[2017-02-28T03:00:46,029][INFO ][o.e.m.j.JvmGcMonitorService] [search1] [gc][470636] overhead, spent [253ms] collecting in the last [1s]

which I understand is an indicator of long garbage collection cycles.

The hot threads display a lot of this when running in production

26.7% (133.4ms out of 500ms) cpu usage by thread 'elasticsearch[search7][search][T#6]'
 2/10 snapshots sharing following 28 elements
   org.apache.lucene.codecs.blocktree.SegmentTermsEnum.seekExact(SegmentTermsEnum.java:483)
   org.apache.lucene.search.TermQuery$TermWeight.getTermsEnum(TermQuery.java:132)
   org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:100)
   org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorer(LRUQueryCache.java:746)
   org.elasticsearch.indices.IndicesQueryCache$CachingWeightWrapper.scorer(IndicesQueryCache.java:155)
   org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:389)
   org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorer(LRUQueryCache.java:746)
   org.elasticsearch.indices.IndicesQueryCache$CachingWeightWrapper.scorer(IndicesQueryCache.java:155)
   org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:389)
   org.apache.lucene.search.Weight.bulkScorer(Weight.java:135)
   org.apache.lucene.search.BooleanWeight.bulkScorer(BooleanWeight.java:370)
   org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.bulkScorer(LRUQueryCache.java:775)
   org.elasticsearch.indices.IndicesQueryCache$CachingWeightWrapper.bulkScorer(IndicesQueryCache.java:161)
   org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:666)
   org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:473)
   org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:379)
   org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:106)
   org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:246)
   org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:260)
   org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:298)
   org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:295)
   org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
   org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1488)
   org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:596)
   org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

It was my understanding that ES 5 does much better on search load. What I don't understand is why does upgrading result in spending more time in elastic and overall an unstable cluster?

system · March 28, 2017, 3:12pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch heavy garbage collection Elasticsearch	2	573	July 6, 2017
ES 2.4 to 5.2 Upgrade Followed By Major Cluster Instability Elasticsearch	24	3469	April 26, 2017
Regarding upgrading Elastic Search server from 0.18.3 Elasticsearch	8	365	July 6, 2017
ElasticSearch timeouts after upgrade to 6.5.1 Elasticsearch	19	1644	January 11, 2019
Garbage collector issues Elasticsearch	7	1921	July 6, 2017

Elasticsearch upgrade to 5.2 causes major degraded performance

Related topics