Crash after a few days

Hello everybody,
I have run the ELK-stack on a single server for testing purposes that last week.
Everything has worked fine, but i have seen more and more WARN rows in the elasticsearchlogs.

And today this happened:
[2017-09-12T15:06:48,256][WARN ][o.e.m.j.JvmGcMonitorService] [ELKSERVER01] [gc][154] overhead, spent [1.4s] collecting in the last [1.4s]
[2017-09-12T15:06:52,532][WARN ][o.e.m.j.JvmGcMonitorService] [ELKSERVER01] [gc][155] overhead, spent [2.6s] collecting in the last [1.3s]
[2017-09-12T15:06:54,707][INFO ][o.e.m.j.JvmGcMonitorService] [ELKSERVER01] [gc][156] overhead, spent [1.5s] collecting in the last [4.6s]
[2017-09-12T15:06:55,795][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [ELKSERVER01] fatal error in thread [elasticsearch[ELKSERVER01][generic][T#2]], exiting
java.lang.OutOfMemoryError: Java heap space
at org.apache.lucene.util.BytesRefHash.rehash(BytesRefHash.java:391) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:302) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:149) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:796) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:447) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:403) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:232) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:478) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1571) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1316) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:663) ~[elasticsearch-5.5.2.jar:5.5.2]
at org.elasticsearch.index.engine.InternalEngine.indexIntoLucene(InternalEngine.java:607) ~[elasticsearch-5.5.2.jar:5.5.2]
at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:505) ~[elasticsearch-5.5.2.jar:5.5.2]
at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:556) ~[elasticsearch-5.5.2.jar:5.5.2]
at org.elasticsearch.index.shard.IndexShard.access$300(IndexShard.java:142) ~[elasticsearch-5.5.2.jar:5.5.2]
at org.elasticsearch.index.shard.IndexShard$IndexShardRecoveryPerformer.index(IndexShard.java:1841) ~[elasticsearch-5.5.2.jar:5.5.2]
at org.elasticsearch.index.shard.TranslogRecoveryPerformer.performRecoveryOperation(TranslogRecoveryPerformer.java:165) ~[elasticsearch-5.5.2.jar:5.5.2]
at org.elasticsearch.index.shard.TranslogRecoveryPerformer.recoveryFromSnapshot(TranslogRecoveryPerformer.java:86) ~[elasticsearch-5.5.2.jar:5.5.2]
at org.elasticsearch.index.shard.IndexShard$IndexShardRecoveryPerformer.recoveryFromSnapshot(IndexShard.java:1836) ~[elasticsearch-5.5.2.jar:5.5.2]
at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:241) ~[elasticsearch-5.5.2.jar:5.5.2]
at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:220) ~[elasticsearch-5.5.2.jar:5.5.2]
at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:91) ~[elasticsearch-5.5.2.jar:5.5.2]
at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:1036) ~[elasticsearch-5.5.2.jar:5.5.2]
at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:990) ~[elasticsearch-5.5.2.jar:5.5.2]
at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:360) ~[elasticsearch-5.5.2.jar:5.5.2]
at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:90) ~[elasticsearch-5.5.2.jar:5.5.2]
at org.elasticsearch.index.shard.StoreRecovery$$Lambda$1500/91854944.run(Unknown Source) ~[?:?]
at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:257) ~[elasticsearch-5.5.2.jar:5.5.2]
at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:88) ~[elasticsearch-5.5.2.jar:5.5.2]
at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1239) ~[elasticsearch-5.5.2.jar:5.5.2]
at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$2(IndexShard.java:1487) ~[elasticsearch-5.5.2.jar:5.5.2]
at org.elasticsearch.index.shard.IndexShard$$Lambda$1499/892406454.run(Unknown Source) ~[?:?]

Dont really know how to troubleshoot this, i removed all content from ES and started all over.

I am running the ELK-Components on WIndows Server 2016. Is that ok? or is it recommended to run on Linux (i am more familiar with Windows).

Is there anything i can configure or change? Any commands to run the next time it happen?
I want to be more preparred next time.

It seems you are running out of heap space. How much heap do you have configured for Elasticsearch? How much data are you indexing into Elasticsearch? How many indices and shards are you creating?

Hi, thanks for the reply Christian.

How do i know how much heap i have configured for ES?
This is my \ProgramData\Elastic\Elasticsearch\config\jvm.options file
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+AlwaysPreTouch
-server
-Xss1m
-Djava.awt.headless=true
-Dfile.encoding=UTF-8
-Djna.nosys=true
-Djdk.io.permissionsUseCanonicalPath=true
-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dio.netty.recycler.maxCapacityPerThread=0
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true
-Dlog4j.skipJansi=true
-XX:+HeapDumpOnOutOfMemoryError
-Xmx579m
-Xms579m

How much data i am indexing? Same here, how do i know?
Index size? Events per hour/minute?

Is indicies and indexes the same?
I have default values when it comes to shards. So five?

Dont know how to troubleshoot and monitor the ELK-stack really.
Are there any built-in maintenance tasks? or do i have to take care of that myself?

Dont know if that gives you the information you need.

/Micke

That is a very small heap, so if you are using default shard settings and storing data for any amount of time that could explain the heap issues. You can find the number of shards through the cat indices or cat shards APIs.

Each shards has some overhead in terms of heap usage, so as your heap is very small you need to make sure you do not have a lot of small shards that waste resources. Aim to have shards that are at least a few GB in size. Reduce the number of primary shards per index and consider switching to using e.g. weekly or monthly indices if you have a long retention period. You can naturally also scale up/out your cluster.

I have created a blog post with some practical guidelines around shard size and count that may be useful.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.