Experiencing very high CPU usage on 1.0.0 RC2 - preventing _nodes/stats returning in sensible time

Folks,

We recently upgraded our cluster to 1.0.0RC2 and now head, kopf and other
plugins seem to be struggling to display cluster stats. Digging down it
looks like the _nodes/stats request is timing out. Looking further there's
very high CPU usage on some, but not all nodes. A hot threads call
(response below) on the hot nodes shows the following. I see some commits
on GitHub and comments in "ElasticSearch this week" to say that
RamUsageEstimator has been identified as a performance issue. Are there any
workarounds or do we need to wait for RC3 :frowning: ?

Thanks in advance!

103.0% (514.8ms out of 500ms) cpu usage by thread
'elasticsearch[*************************][management][T#4]'
10/10 snapshots sharing following 21 elements
java.lang.reflect.Array.get(Native Method)

org.apache.lucene.util.RamUsageEstimator.measureObjectSize(RamUsageEstimator.java:456)

org.apache.lucene.util.RamUsageEstimator.sizeOf(RamUsageEstimator.java:350)

org.apache.lucene.codecs.lucene3x.Lucene3xFields.ramBytesUsed(Lucene3xFields.java:1080)

org.apache.lucene.index.SegmentCoreReaders.ramBytesUsed(SegmentCoreReaders.java:195)

org.apache.lucene.index.SegmentReader.ramBytesUsed(SegmentReader.java:558)

org.elasticsearch.index.engine.internal.InternalEngine.getReaderRamBytesUsed(InternalEngine.java:1124)

org.elasticsearch.index.engine.internal.InternalEngine.segmentsStats(InternalEngine.java:1136)

org.elasticsearch.index.shard.service.InternalIndexShard.segmentStats(InternalIndexShard.java:532)

org.elasticsearch.action.admin.indices.stats.CommonStats.(CommonStats.java:161)

org.elasticsearch.action.admin.indices.stats.ShardStats.(ShardStats.java:49)

org.elasticsearch.indices.InternalIndicesService.stats(InternalIndicesService.java:211)

org.elasticsearch.node.service.NodeService.stats(NodeService.java:156)

org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:100)

org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:43)

org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:281)

org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:272)

org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:270)
java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
java.lang.Thread.run(Unknown Source)

103.0% (514.8ms out of 500ms) cpu usage by thread
'elasticsearch[*************************][management][T#3]'
10/10 snapshots sharing following 19 elements
java.lang.reflect.Array.get(Native Method)

org.apache.lucene.util.RamUsageEstimator.measureObjectSize(RamUsageEstimator.java:456)

org.apache.lucene.util.RamUsageEstimator.sizeOf(RamUsageEstimator.java:350)

org.apache.lucene.codecs.lucene3x.Lucene3xFields.ramBytesUsed(Lucene3xFields.java:1080)

org.apache.lucene.index.SegmentCoreReaders.ramBytesUsed(SegmentCoreReaders.java:195)

org.apache.lucene.index.SegmentReader.ramBytesUsed(SegmentReader.java:558)

org.elasticsearch.index.engine.internal.InternalEngine.getReaderRamBytesUsed(InternalEngine.java:1124)

org.elasticsearch.index.engine.internal.InternalEngine.segmentsStats(InternalEngine.java:1136)

org.elasticsearch.index.shard.service.InternalIndexShard.segmentStats(InternalIndexShard.java:532)

org.elasticsearch.action.admin.indices.stats.CommonStats.(CommonStats.java:161)

org.elasticsearch.action.admin.indices.stats.ShardStats.(ShardStats.java:49)

org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:197)

org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:53)

org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:413)

org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:399)

org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:270)
java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
java.lang.Thread.run(Unknown Source)

99.8% (499.2ms out of 500ms) cpu usage by thread
'elasticsearch[*************************][marvel.exporters]'
3/10 snapshots sharing following 14 elements
java.lang.reflect.Array.get(Native Method)

org.apache.lucene.util.RamUsageEstimator.measureObjectSize(RamUsageEstimator.java:456)

org.apache.lucene.util.RamUsageEstimator.sizeOf(RamUsageEstimator.java:350)

org.apache.lucene.codecs.lucene3x.Lucene3xFields.ramBytesUsed(Lucene3xFields.java:1080)

org.apache.lucene.index.SegmentCoreReaders.ramBytesUsed(SegmentCoreReaders.java:195)

org.apache.lucene.index.SegmentReader.ramBytesUsed(SegmentReader.java:558)

org.elasticsearch.index.engine.internal.InternalEngine.getReaderRamBytesUsed(InternalEngine.java:1124)

org.elasticsearch.index.engine.internal.InternalEngine.segmentsStats(InternalEngine.java:1136)

org.elasticsearch.index.shard.service.InternalIndexShard.segmentStats(InternalIndexShard.java:532)

org.elasticsearch.action.admin.indices.stats.CommonStats.(CommonStats.java:161)

org.elasticsearch.action.admin.indices.stats.ShardStats.(ShardStats.java:49)

org.elasticsearch.marvel.agent.AgentService$ExportingWorker.exportShardStats(AgentService.java:252)

org.elasticsearch.marvel.agent.AgentService$ExportingWorker.run(AgentService.java:175)
java.lang.Thread.run(Unknown Source)
2/10 snapshots sharing following 14 elements

org.apache.lucene.util.RamUsageEstimator$IdentityHashSet.contains(RamUsageEstimator.java:705)

org.apache.lucene.util.RamUsageEstimator.measureObjectSize(RamUsageEstimator.java:478)

org.apache.lucene.util.RamUsageEstimator.sizeOf(RamUsageEstimator.java:350)

org.apache.lucene.codecs.lucene3x.Lucene3xFields.ramBytesUsed(Lucene3xFields.java:1080)

org.apache.lucene.index.SegmentCoreReaders.ramBytesUsed(SegmentCoreReaders.java:195)

org.apache.lucene.index.SegmentReader.ramBytesUsed(SegmentReader.java:558)

org.elasticsearch.index.engine.internal.InternalEngine.getReaderRamBytesUsed(InternalEngine.java:1124)

org.elasticsearch.index.engine.internal.InternalEngine.segmentsStats(InternalEngine.java:1136)

org.elasticsearch.index.shard.service.InternalIndexShard.segmentStats(InternalIndexShard.java:532)

org.elasticsearch.action.admin.indices.stats.CommonStats.(CommonStats.java:161)

org.elasticsearch.action.admin.indices.stats.ShardStats.(ShardStats.java:49)

org.elasticsearch.marvel.agent.AgentService$ExportingWorker.exportShardStats(AgentService.java:252)

org.elasticsearch.marvel.agent.AgentService$ExportingWorker.run(AgentService.java:175)
java.lang.Thread.run(Unknown Source)
2/10 snapshots sharing following 13 elements

org.apache.lucene.util.RamUsageEstimator.measureObjectSize(RamUsageEstimator.java:489)

org.apache.lucene.util.RamUsageEstimator.sizeOf(RamUsageEstimator.java:350)

org.apache.lucene.codecs.lucene3x.Lucene3xFields.ramBytesUsed(Lucene3xFields.java:1080)

org.apache.lucene.index.SegmentCoreReaders.ramBytesUsed(SegmentCoreReaders.java:195)

org.apache.lucene.index.SegmentReader.ramBytesUsed(SegmentReader.java:558)

org.elasticsearch.index.engine.internal.InternalEngine.getReaderRamBytesUsed(InternalEngine.java:1124)

org.elasticsearch.index.engine.internal.InternalEngine.segmentsStats(InternalEngine.java:1136)

org.elasticsearch.index.shard.service.InternalIndexShard.segmentStats(InternalIndexShard.java:532)

org.elasticsearch.action.admin.indices.stats.CommonStats.(CommonStats.java:161)

org.elasticsearch.action.admin.indices.stats.ShardStats.(ShardStats.java:49)

org.elasticsearch.marvel.agent.AgentService$ExportingWorker.exportShardStats(AgentService.java:252)

org.elasticsearch.marvel.agent.AgentService$ExportingWorker.run(AgentService.java:175)
java.lang.Thread.run(Unknown Source)
3/10 snapshots sharing following 12 elements

org.apache.lucene.util.RamUsageEstimator.sizeOf(RamUsageEstimator.java:350)

org.apache.lucene.codecs.lucene3x.Lucene3xFields.ramBytesUsed(Lucene3xFields.java:1080)

org.apache.lucene.index.SegmentCoreReaders.ramBytesUsed(SegmentCoreReaders.java:195)

org.apache.lucene.index.SegmentReader.ramBytesUsed(SegmentReader.java:558)

org.elasticsearch.index.engine.internal.InternalEngine.getReaderRamBytesUsed(InternalEngine.java:1124)

org.elasticsearch.index.engine.internal.InternalEngine.segmentsStats(InternalEngine.java:1136)

org.elasticsearch.index.shard.service.InternalIndexShard.segmentStats(InternalIndexShard.java:532)

org.elasticsearch.action.admin.indices.stats.CommonStats.(CommonStats.java:161)

org.elasticsearch.action.admin.indices.stats.ShardStats.(ShardStats.java:49)

org.elasticsearch.marvel.agent.AgentService$ExportingWorker.exportShardStats(AgentService.java:252)

org.elasticsearch.marvel.agent.AgentService$ExportingWorker.run(AgentService.java:175)
java.lang.Thread.run(Unknown Source)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0ce79c89-b16f-49c2-b7e6-f69cbc2ff151%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Interesting fact is that your stacktraces point to a Lucene 3.x field RAM
size estimator method, which might have some yet not identified issues.
This indicates to me that you still have an index originating from an early
ES version (<0.90?)

If possible, I recommend to reindex data with 1.0.0.RC2, to benefit from a
fresh built Lucene 4.6.1 index.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHt7%2BL4khCySG7A%2BpjHQ%3DJLLmQ3wq8LQOSrrstx5dj%3DiQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey,

maybe it is possible to exclude the segment statistics (if you do not need
them) and not run into that performance problem as a quick hack...

--Alex

On Mon, Feb 10, 2014 at 6:54 PM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

Interesting fact is that your stacktraces point to a Lucene 3.x field RAM
size estimator method, which might have some yet not identified issues.
This indicates to me that you still have an index originating from an early
ES version (<0.90?)

If possible, I recommend to reindex data with 1.0.0.RC2, to benefit from a
fresh built Lucene 4.6.1 index.

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHt7%2BL4khCySG7A%2BpjHQ%3DJLLmQ3wq8LQOSrrstx5dj%3DiQ%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGCwEM8HcJqgmPQ7Wj8LRC%3D%2Bkso9L3ndMN_mwjQj2cJjAZUFig%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.