Hello,
We are running an es-cluster with 13 nodes, 10 data and 3 master, on Amazon
hi1.4xlarge machines. The cluster contains almost 10T of data (including
one replica). It is running Elasticsearch 1.1.1 on Oracle java 1.7.0_25.
Our problem is that every now and then the cpu load suddenly increases on
one of the data nodes. The load average can suddenly jump from about 4 up
to 10-16, and once it has jumped up it stays there. Then after a couple of
days another node is also affected and so on. Eventually most nodes in the
cluster are affected and we have to restart them. A restart of the Java
process brings the load back to normal.
We are not experiencing any abnormal levels of garbage collection on the
affected nodes.
I did a java stack dump on one of the affected node and one things which
stood out was that it had a nubber of threads with state IN_JAVA, the
non-loaded nodes had no such threads. The stack-dump for these threads
ivariably looks something lie this:
Thread 23022: (state = IN_JAVA)
- java.util.HashMap.getEntry(java.lang.Object) @bci=72, line=446 (Compiled
frame; information may be imprecise) - java.util.HashMap.get(java.lang.Object) @bci=11, line=405 (Compiled
frame)
org.elasticsearch.search.scan.ScanContext$ScanFilter.getDocIdSet(org.apache.lucene.index.AtomicReaderContext,
org.apache.lucene.util.Bits) @bci=8, line=156 (Compiled frame)
org.elasticsearch.common.lucene.search.ApplyAcceptedDocsFilter.getDocIdSet(org.apache.lucene.index.AtomicReaderContext,
org.apache.lucene.util.Bits) @bci=6, line=45 (Compiled frame)
org.apache.lucene.search.FilteredQuery$1.scorer(org.apache.lucene.index.AtomicReaderContext,
boolean, boolean, org.apache.lucene.util.Bits) @bci=34, line=130 (Compiled
frame)
- org.apache.lucene.search.IndexSearcher.search(java.util.List,
org.apache.lucene.search.Weight, org.apache.lucene.search.Collector)
@bci=68, line=618 (Compiled frame)
org.elasticsearch.search.internal.ContextIndexSearcher.search(java.util.List,
org.apache.lucene.search.Weight, org.apache.lucene.search.Collector)
@bci=225, line=173 (Compiled frame)
org.apache.lucene.search.IndexSearcher.search(org.apache.lucene.search.Query,
org.apache.lucene.search.Collector) @bci=11, line=309 (Interpreted frame)
org.elasticsearch.search.scan.ScanContext.execute(org.elasticsearch.search.internal.SearchContext)
@bci=54, line=52 (Interpreted frame)
org.elasticsearch.search.query.QueryPhase.execute(org.elasticsearch.search.internal.SearchContext)
@bci=174, line=119 (Compiled frame)
org.elasticsearch.search.SearchService.executeScan(org.elasticsearch.search.internal.InternalScrollSearchRequest)
@bci=49, line=233 (Interpreted frame)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchScanScrollTransportHandler.messageReceived(org.elasticsearch.search.internal.InternalScrollSearchRequest,
org.elasticsearch.transport.TransportChannel) @bci=8, line=791 (Interpreted
frame)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchScanScrollTransportHandler.messageReceived(org.elasticsearch.transport.TransportRequest,
org.elasticsearch.transport.TransportChannel) @bci=6, line=780 (Interpreted
frame)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run()
@bci=12, line=270 (Compiled frame)
java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker)
@bci=95, line=1145 (Compiled frame)
- java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615
(Interpreted frame) - java.lang.Thread.run() @bci=11, line=724 (Interpreted frame)
Does anybody know what we are experiencing, or have any tips on how to
further debug this?
/MaF
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e83a7e9f-6fe4-4d45-b19c-95f8d8418659%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.