Elasticsearch Heap size growing with time and lot of GC, eventually pulling the cluster down

ramindar · November 8, 2015, 12:23am

I have a ES cluster with 16 nodes in Azure

3 Master (A2 machine - 3.5GB mem)
3 Search (A2 machine - 3.5GB mem)
10 Data Nodes (A3 machine - 7 GB mem)

In total 6 indexes

Index Doc Size Shards Replicas Status
index-1 64,068 249.6MB 2 2 open
index-2 93 2.6MB 2 2 open
index-3 24,502 24.4MB 2 2 open
index-4 12,697 4.3MB 2 2 open
index-5 17,556 7.0MB 2 2 open
index-6 347,483 92.8MB 2 3 open

Elastic Search Version - 1.7.2
OS: Windows Server 2012 R2 (6.3)
Java Version: 1.8.0_60
JVM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23, mixed mode)
Java Vendor: Oracle Corporation

elasticsearch.yml

bootstrap.mlockall: true

indices.breaker.total.limit: 70%
indices.fielddata.cache.size: 70%

Logs:
[2015-11-07 20:30:27,688][WARN ][monitor.jvm ] [ES-PROD-12-data] [gc][old][205968][5993] duration [22.4s], collections [1]/[23.1s], total [22.4s]/[44.8m], memory [3.4gb]->[3.1gb]/[3.4gb], all_pools {[young] [252.1mb]->[2.2kb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.1gb]->[3.1gb]/[3.1gb]}
[2015-11-07 20:31:02,485][WARN ][monitor.jvm ] [ES-PROD-12-data] [gc][old][205969][5994] duration [33.7s], collections [1]/[34.7s], total [33.7s]/[45.3m], memory [3.1gb]->[3.1gb]/[3.4gb], all_pools {[young] [2.2kb]->[116.2kb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.1gb]->[3.1gb]/[3.1gb]}
[2015-11-07 20:31:26,313][WARN ][monitor.jvm ] [ES-PROD-12-data] [gc][old][205970][5995] duration [22.6s], collections [1]/[23.8s], total [22.6s]/[45.7m], memory [3.1gb]->[3.1gb]/[3.4gb], all_pools {[young] [116.2kb]->[6.2mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.1gb]->[3.1gb]/[3.1gb]}
[2015-11-07 20:32:01,125][WARN ][monitor.jvm ] [ES-PROD-12-data] [gc][old][205972][5996] duration [33.3s], collections [1]/[33.7s], total [33.3s]/[46.3m], memory [3.4gb]->[3.1gb]/[3.4gb], all_pools {[young] [266.2mb]->[5.5mb]/[266.2mb]}{[survivor] [8.5mb]->[0b]/[33.2mb]}{[old] [3.1gb]->[3.1gb]/[3.1gb]}
[2015-11-07 20:32:24,718][WARN ][monitor.jvm ] [ES-PROD-12-data] [gc][old][205974][5997] duration [22.2s], collections [1]/[22.5s], total [22.2s]/[46.6m], memory [3.4gb]->[3.1gb]/[3.4gb], all_pools {[young] [266.2mb]->[9.4mb]/[266.2mb]}{[survivor] [15.3mb]->[0b]/[33.2mb]}{[old] [3.1gb]->[3.1gb]/[3.1gb]}
[2015-11-07 20:32:58,937][WARN ][monitor.jvm ] [ES-PROD-12-data] [gc][old][205976][5998] duration [32.5s], collections [1]/[33.2s], total [32.5s]/[47.2m], memory [3.4gb]->[3.1gb]/[3.4gb], all_pools {[young] [254.3mb]->[17.4mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.1gb]->[3.1gb]/[3.1gb]}

GC kicks in every minute and takes around have a minute, eventually the queries queue up and we start getting the exceptions in the Search Nodes

org.elasticsearch.transport.RemoteTransportException: [ES-PROD-12-data][inet[/10.1.0.7:9300]][indices:data/read/search[phase/query]]
Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution (queue capacity 1000) on org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler@5b5e7b85
at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:62)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:79)
at org.elasticsearch.transport.netty.MessageChannelHandler.handleRequest(MessageChannelHandler.java:224)

Srinath_C · November 8, 2015, 1:15am

Hi Raminder,

I think indices.fielddata.cache.size: 70% is too high. Try reducing that to
a very low value and instead of that set doc_values on all you attributes.
This should significantly reduce the load on the heap. The disk utilization
will increase though.

Regards,
Srinath

ramindar · November 10, 2015, 6:53pm

Thanks Srinath.
I reduced the indices.fielddata.cache.size: to 10% now, and the heap seems to happy for few days. Post that everyday the Heap size seems to be growing again. even thought the data remains the same. Not sure if i regularly run optimize over the indexes, would that help ?

Srinath_C · November 11, 2015, 12:12am

Reducing indices.fielddata.cache.size: to 10% was not the fix. Setting
doc_values on properties in the index mappings was the solution. With that
there will be a reduced load on memory. Did you set doc_values and try?

ramindar · November 12, 2015, 7:41am

Hi Srinath,

I tried with the doc_values as well. but the problem still persists. After few hours the old gen memory grows to its limit and GC kicks in and same issue starts again. also there is another error

org.elasticsearch.transport.RemoteTransportException: [ES-PROD-9-data][inet[/10.1.0.18:9300]][indices:data/read/search[phase/fetch/id]]
Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [1068983]
at org.elasticsearch.search.SearchService.findContext(SearchService.java:537)
at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:507)
at org.elasticsearch.search.action.SearchServiceTransportAction$FetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:868)
at org.elasticsearch.search.action.SearchServiceTransportAction$FetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:862)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:279)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

I took the heap dump and observed that most of the memory, 25% is used by ConcurrentHahMap and 25% is used by string. Not sure what is happening

ramindar · November 20, 2015, 4:13am

I thought memory is the problem, so i tried to increase the VM size to A6 having 28gb memory, but still i got the same issue.

Search Thread count goes to 1000 and then huge set of exceptions

org.elasticsearch.transport.RemoteTransportException:
[ES-PROD-12-data][inet[/10.1.0.7:9300]][indices:data/read/search[phase/query]]
Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution (queue capacity 1000) on

any help would be greatly appreciated.

Topic		Replies	Views
Long running GC, cluster status RED, only few GB's data Elasticsearch	12	2519	July 5, 2017
Long GC on Elasticsearch master nodes Elasticsearch	4	1654	March 16, 2017
ElasticSearch gc performance on cluster Elasticsearch	3	650	July 5, 2017
Es instance unusable Elasticsearch	7	737	October 12, 2020
ES used heap % grows slowly until system becomes unresponsive Elasticsearch	21	5523	July 5, 2017

Elasticsearch Heap size growing with time and lot of GC, eventually pulling the cluster down

Related topics