Mysterious sudden load increase

(maf) #1


We are running an es-cluster with 13 nodes, 10 data and 3 master, on Amazon
hi1.4xlarge machines. The cluster contains almost 10T of data (including
one replica). It is running Elasticsearch 1.1.1 on Oracle java 1.7.0_25.

Our problem is that every now and then the cpu load suddenly increases on
one of the data nodes. The load average can suddenly jump from about 4 up
to 10-16, and once it has jumped up it stays there. Then after a couple of
days another node is also affected and so on. Eventually most nodes in the
cluster are affected and we have to restart them. A restart of the Java
process brings the load back to normal.

We are not experiencing any abnormal levels of garbage collection on the
affected nodes.

I did a java stack dump on one of the affected node and one things which
stood out was that it had a nubber of threads with state IN_JAVA, the
non-loaded nodes had no such threads. The stack-dump for these threads
ivariably looks something lie this:

Thread 23022: (state = IN_JAVA)

  • java.util.HashMap.getEntry(java.lang.Object) @bci=72, line=446 (Compiled
    frame; information may be imprecise)
  • java.util.HashMap.get(java.lang.Object) @bci=11, line=405 (Compiled
org.apache.lucene.util.Bits) @bci=8, line=156 (Compiled frame),
org.apache.lucene.util.Bits) @bci=6, line=45 (Compiled frame)$1.scorer(org.apache.lucene.index.AtomicReaderContext,
boolean, boolean, org.apache.lucene.util.Bits) @bci=34, line=130 (Compiled

    @bci=68, line=618 (Compiled frame),,
@bci=225, line=173 (Compiled frame), @bci=11, line=309 (Interpreted frame)
@bci=54, line=52 (Interpreted frame)
@bci=174, line=119 (Compiled frame)
@bci=49, line=233 (Interpreted frame)$SearchScanScrollTransportHandler.messageReceived(,
org.elasticsearch.transport.TransportChannel) @bci=8, line=791 (Interpreted
org.elasticsearch.transport.TransportChannel) @bci=6, line=780 (Interpreted

@bci=12, line=270 (Compiled frame)

@bci=95, line=1145 (Compiled frame)

  • java.util.concurrent.ThreadPoolExecutor$ @bci=5, line=615
    (Interpreted frame)
  • @bci=11, line=724 (Interpreted frame)

Does anybody know what we are experiencing, or have any tips on how to
further debug this?


You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
To view this discussion on the web visit
For more options, visit

(system) #2