java.lang.OutOfMemoryError: Java heap space --------- with bad net work

(陈闯) #1

Elasticsearch version

bin/elasticsearch --version
6.3.2

Plugins installed

bin/elasticsearch-plugin list
analysis-ik 

JVM version

java -version
java version "1.8.0_91"
Java(TM) SE Runtime Environment (build 1.8.0_91-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.91-b14, mixed mode)

OS version

uname -a
Linux bj2-search-log-es05.uclcn 2.6.32-431.11.32.el6.ucloud.x86_64 #1 SMP Sun Jun 18 20:58:44 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem
All the master nodes(5 nodes,these nodes are both master node and data node ) of my es cluster(20 nodes total and the data nodes is ok ) encounter **java.lang.OutOfMemoryError ** and fall down . But the other 15 data nodes still well .
It goes through some stage , and i will try to describle it detail ,but the limit of this body is 7000character, so i delete some

1. es log show some gc logs

the gc log is more than this ,


[2019-05-16T05:40:42,236][ERROR][o.e.x.m.c.i.IndexStatsCollector] [ES-1] collector [index-stats] timed out when collecting data
[2019-05-16T05:57:07,440][WARN ][o.e.m.j.JvmGcMonitorService] [ES-1] [gc][17140971] overhead, spent [10s] collecting in the last [10s]
[2019-05-16T05:57:20,870][INFO ][o.e.m.j.JvmGcMonitorService] [ES-1] [gc][old][17140975][11] duration [9.7s], collections [1]/[10.4s], total [9.7s]/[1.8m], memory [21.3gb]->[20.3gb]/[31gb], all_pools {[young] [1gb]->[64mb]/[0b]}{[survivor] [32mb]->[32mb]/[0b]}{[old] [20.2gb]->[20.2gb]/[31gb]}
[2019-05-16T05:57:20,870][WARN ][o.e.m.j.JvmGcMonitorService] [ES-1] [gc][17140975] overhead, spent [9.9s] collecting in the last [10.4s]

.... 

2. it appear a lot exception

after stage 1
the log show a lot of EsRejectedExecutionException this continued for about 20 minute

[2019-05-16T05:57:20,918][ERROR][o.e.a.b.TransportBulkAction] [ES-1] failed to execute pipeline for a bulk request
org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.ingest.PipelineExecutionService$1@6454f572 on EsThreadPoolExecutor[name = ES-1/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@56fd2b45[Running, pool size = 16, active threads = 16, queued tasks = 200, completed tasks = 2013328964]]
	at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:48) ~[elasticsearch-6.3.2.jar:6.3.2]
	at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) ~[?:1.8.0_91]
	at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) ~[?:1.8.0_91]
	at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.doExecute(EsThreadPoolExecutor.java:98) ~[elasticsearch-6.3.2.jar:6.3.2]
	at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:93) ~[elasticsearch-6.3.2.jar:6.3.2]
	at org.elasticsearch.ingest.PipelineExecutionService.executeBulkRequest(PipelineExecutionService.java:59) ~[elasticsearch-6.3.2.jar:6.3.2]
	at org.elasticsearch.action.bulk.TransportBulkAction.processBulkIndexIngestRequest(TransportBulkAction.java:495) ~[elasticsearch-6.3.2.jar:6.3.2]
	at org.elasticsearch.action.bulk.TransportBulkAction.doExecute(TransportBulkAction.java:134) ~[elasticsearch-6.3.2.jar:6.3.2]
	at org.elasticsearch.action.bulk.TransportBulkAction.doExecute(TransportBulkAction.java:85) ~[elasticsearch-6.3.2.jar:6.3.2]
	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:167) ~[elasticsearch-6.3.2.jar:6.3.2]
	at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.apply(SecurityActionFilter.java:128) ~[?:?]
	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:165) ~[elasticsearch-6.3.2.jar:6.3.2]
	
....

3. it encounter another gc process , and fianally died

after stage 2 it encounter another gc process , and fianally died
before this gc happed the exception have continued 20 minutes
the log shows may be a mege request lead to this gc , and finally crashed
the merge request is an sheduled task , but i do not think it is the main reason .

[2019-05-16T06:17:08,404][WARN ][o.e.m.j.JvmGcMonitorService] [ES-1] [gc][old][17141110][128] duration [1m], collections [6]/[1m], total [1m]/[20.2m], memory [21.1gb]->[21.1gb]/[31gb], all_pools {[young] [0b]->[32mb]/[0b]}{[survivor] [0b]->[0b]/[0b]}{[old] [21.1gb]->[21gb]/[31gb]}
....
[2019-05-16T06:19:06,273][ERROR][o.e.i.e.Engine           ] [ES-1] [kk-log_cbb_base_mq_mq-service-2019.05.15][4] merge failed
java.lang.OutOfMemoryError: Java heap space
	at org.apache.lucene.codecs.lucene50.Lucene50PostingsWriter.newTermState(Lucene50PostingsWriter.java:174) ~[lucene-core-7.3.1.jar:7.3.1 ae0705edb59eaa567fe13ed3a222fdadc7153680 - caomanhdat - 2018-05-09 09:27:24]
	at org.apache.lucene.codecs.lucene50.Lucene50PostingsWriter.newTermState(Lucene50PostingsWriter.java:57) ~[lucene-core-7.3.1.jar:7.3.1 ae0705edb59eaa567fe13ed3a222fdadc7153680 - caomanhdat - 2018-05-09 09:27:24]
	at org.apache.lucene.codecs.PushPostingsWriterBase.writeTerm(PushPostingsWriterBase.java:166) ~[lucene-core-7.3.1.jar:7.3.1 ae0705edb59eaa567fe13ed3a222fdadc7153680 - caomanhdat - 2018-05-09 09:27:24]
	
[2019-05-16T06:19:06,274][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [ES-1] fatal error in thread [elasticsearch[ES-1][generic][T#171]], exiting
java.lang.OutOfMemoryError: Java heap space
	at org.apache.lucene.codecs.lucene50.Lucene50PostingsWriter.newTermState(Lucene50PostingsWriter.java:174) ~[lucene-core-7.3.1.jar:7.3.1 ae0705edb59eaa567fe13ed3a222fdadc7153680 - caomanhdat - 2018-05-09 09:27:24]
	at org.apache.lucene.codecs.lucene50.Lucene50PostingsWriter.newTermState(Lucene50PostingsWriter.java:57) ~[lucene-core-7.3.1.jar:7.3.1 ae0705edb59eaa567fe13ed3a222fdadc7153680 - caomanhdat - 2018-05-09 09:27:24]
	at org.apache.lucene.codecs.PushPostingsWriterBase.writeTerm(PushPostingsWriterBase.java:166) ~[lucene-core-7.3.1.jar:7.3.1 ae0705edb59eaa567fe13ed3a222fdadc7153680 - caomanhdat - 2018-05-09 09:27:24]
	at org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.write(BlockTreeTermsWriter.java:864) ~[lucene-core-7.3.1.jar:7.3.1 ae0705edb59eaa567fe13ed3a222fdadc7153680 - caomanhdat - 2018-05-09 09:27:24]
	at org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter.write(BlockTreeTermsWriter.java:343) ~[lucene-core-7.3.1.jar:7.3.1 ae0705edb59eaa567fe13ed3a222fdadc7153680 - caomanhdat - 2018-05-09 09:27:24]
	at org.apache.

....
....

4. after this master down , other node can be the master encounter the similar satuation.

5. during the time of the 1-4 stage the network is not good , but i do not have the detail monitor

and the tcp ListenOverflows shows like this

(David Turner) #2

If a node fails with an OutOfMemoryError then it writes a heap dump by default, and the best way to investigate is to open this heap dump in a tool like MAT and look for surprising memory consumers.

It does sound like your cluster would benefit from dedicated master nodes.

(陈闯) #3

thank you so musch david , i forgot to set the heap dump file to "data disk" but on the "system disk" and the disk is not big enough to store the total heap dump file.

i will make the master no data later , thank you so much!