Elasticsearch upgrade to 5.2 causes major degraded performance

Our cluster performed pretty much near perfect on 2.4. After upgrading to 5.2 we are seeing time spent in elastic increased by 6x, basically making our service useless as it is too slow. The only thing in the logs are a bunch of these

[2017-02-22T12:23:27,199][INFO ][o.e.m.j.JvmGcMonitorService] [search1] [gc][32939] overhead, spent [273ms] collecting in the last [1s]
[2017-02-22T12:23:47,273][WARN ][o.e.m.j.JvmGcMonitorService] [search1] [gc][32959] overhead, spent [600ms] collecting in the last [1s]

which indicate lots of time spent garbage collecting. This doesn't give much of a clue as I've gone through the docs many times searching for answers and am not coming up with any. I've confirmed the right amount of heap space is being allocated (31GB on a 64GB machine) with 8 nodes. This seems to be a common complaint after going through the forum but not many people seem to be getting anywhere. Is this a known issue or is there an obvious clue here I am missing? Any help is appreciated. Thank you.

Also want to point out all nodes are running openjdk version "1.8.0_111". I know the docs say 1.8.0_73 is recommended but I'm not sure if this matters since its the same major version.

1 Like

Does hot_threads show something interesting?

When we first tested this, we had a shard count of about 4500 but have since lowered it to about 1000 just by getting rid of some monthly indices and turning them into yearly indices. (We are a time series use case). This reduced the overall shard search load, which we saw an improvement in time spent in elastic but we are still having the same problem. The cluster just doesn't hold up in production at scale.

The logs indicate the search queue capacity is getting reached, which makes sense because the 5.2 version is spending between 5-10ms in elastic where the 2.4 version was only spending about 3.5ms. Our app makes heavy use of the search queue.

The CPU on the nodes hover between 20 and 25%, where the 2.4 cluster the CPU is slightly lower but tends to reach the same levels.

The logs display lines like these

[2017-02-28T03:00:46,029][INFO ][o.e.m.j.JvmGcMonitorService] [search1] [gc][470636] overhead, spent [253ms] collecting in the last [1s]

which I understand is an indicator of long garbage collection cycles.

The hot threads display a lot of this when running in production

26.7% (133.4ms out of 500ms) cpu usage by thread 'elasticsearch[search7][search][T#6]'
 2/10 snapshots sharing following 28 elements
   org.apache.lucene.codecs.blocktree.SegmentTermsEnum.seekExact(SegmentTermsEnum.java:483)
   org.apache.lucene.search.TermQuery$TermWeight.getTermsEnum(TermQuery.java:132)
   org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:100)
   org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorer(LRUQueryCache.java:746)
   org.elasticsearch.indices.IndicesQueryCache$CachingWeightWrapper.scorer(IndicesQueryCache.java:155)
   org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:389)
   org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorer(LRUQueryCache.java:746)
   org.elasticsearch.indices.IndicesQueryCache$CachingWeightWrapper.scorer(IndicesQueryCache.java:155)
   org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:389)
   org.apache.lucene.search.Weight.bulkScorer(Weight.java:135)
   org.apache.lucene.search.BooleanWeight.bulkScorer(BooleanWeight.java:370)
   org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.bulkScorer(LRUQueryCache.java:775)
   org.elasticsearch.indices.IndicesQueryCache$CachingWeightWrapper.bulkScorer(IndicesQueryCache.java:161)
   org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:666)
   org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:473)
   org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:379)
   org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:106)
   org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:246)
   org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:260)
   org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:298)
   org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:295)
   org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
   org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1488)
   org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:596)
   org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

It was my understanding that ES 5 does much better on search load. What I don't understand is why does upgrading result in spending more time in elastic and overall an unstable cluster?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.