One thread 100% CPU

We have 3 Nodes, we had these in hot threads:

Hot threads at 2017-03-23T12:04:29.200Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

97.4% (487.1ms out of 500ms) cpu usage by thread 'elasticsearch[pro-analytics3][management][T#2]'
2/10 snapshots sharing following 24 elements
sun.nio.fs.UnixPath.(UnixPath.java:71)
sun.nio.fs.UnixFileSystem.getPath(UnixFileSystem.java:281)
sun.nio.fs.AbstractPath.resolve(AbstractPath.java:53)
org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:209)
org.apache.lucene.store.FileSwitchDirectory.fileLength(FileSwitchDirectory.java:150)
org.apache.lucene.store.FilterDirectory.fileLength(FilterDirectory.java:67)
org.apache.lucene.store.FilterDirectory.fileLength(FilterDirectory.java:67)
org.elasticsearch.index.store.Store$StoreStatsCache.estimateSize(Store.java:1543)
org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1532)
org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1519)
org.elasticsearch.common.util.SingleObjectCache.getOrRefresh(SingleObjectCache.java:55)
org.elasticsearch.index.store.Store.stats(Store.java:293)
org.elasticsearch.index.shard.IndexShard.storeStats(IndexShard.java:665)
org.elasticsearch.action.admin.indices.stats.CommonStats.(CommonStats.java:134)
org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:165)

in logs we see:

[2017-03-23 12:58:29,756][WARN ][transport ] [pro-analytics3] Received response for a request that has timed out, sent [99355ms] ago, timed out [58057ms] ago, action [cluster:monitor/nodes/stats[n]], node [{pro-analytics3}{fkxzHRaSTUiHR99KbZKT8Q}

[2017-03-23 12:58:29,768][WARN ][monitor.jvm ] [pro-analytics3] [gc][old][386283][774] duration [57.2s], collections [1]/[58.2s], total [57.2s]/[2h], memory [14.1gb]->[14.1gb]/[14.3gb], all_pools {[young] [409.2mb]->[418.5mb]/[532.5mb]}{[survivor] [0b]->[0b]/[66.5mb]}{[old] [13.7gb]->[13.7gb]/[13.7gb]}

[2017-03-23 12:59:15,208][WARN ][monitor.jvm ] [pro-analytics3] [gc][old][386284][775] duration [44.6s], collections [1]/[45s], total [44.6s]/[2h], memory [14.1gb]->[14.1gb]/[14.3gb], all_pools {[young] [418.5mb]->[422.9mb]/[532.5mb]}{[survivor] [0b]->[0b]/[66.5mb]}{[old] [13.7gb]->[13.7gb]/[13.7gb]}

[2017-03-23 13:00:12,361][WARN ][discovery.zen.publish ] [pro-analytics3] timed out waiting for all nodes to process published state [18657] (timeout [30s], pending nodes: [{pro-analytics2}{pMCPbvEzSEqp7lwUD_rvKg}{box_type=hot}, {pro-analytics1}{jO9IIlQFT_qJI5gI7JTHHg}{box_type=hot}])

That is some very, very long GC you have there. Which version of Elasticsearch are you using? If you have monitoring installed, what does your heap usage look like?

"version": {
"number": "2.2.0",

we just have kopf and yes it was like 98% now... but normaly it goes from 50% to 70%

What you see in the logs is very long GC causing problems. If this is a recurring problem I would recommend looking at what takes up your heap and try to address that, or maybe even scale out.

ok thanks, any recomentdation in how to do the investigation: what takes up your heap

Look at the node stats API. Having a very large number of shards can also tie up resources. Also check that you have swap disabled and that memory is not over-committed if you are using VMs, as this can slow down GC.

thanks

do you nkow if ther is a way to change master node to other node manually?

No, there is no API for that. Am not sure how that would help either.

Not for solving the issue but for pass the resposability/work to other node so old master node can solve have more resources

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.