Elasticsearch Heap size growing with time and lot of GC, eventually pulling the cluster down


(Ramindar Singh) #1

I have a ES cluster with 16 nodes in Azure

3 Master (A2 machine - 3.5GB mem)
3 Search (A2 machine - 3.5GB mem)
10 Data Nodes (A3 machine - 7 GB mem)

In total 6 indexes

Index Doc Size Shards Replicas Status
index-1 64,068 249.6MB 2 2 open
index-2 93 2.6MB 2 2 open
index-3 24,502 24.4MB 2 2 open
index-4 12,697 4.3MB 2 2 open
index-5 17,556 7.0MB 2 2 open
index-6 347,483 92.8MB 2 3 open

Elastic Search Version - 1.7.2
OS: Windows Server 2012 R2 (6.3)
Java Version: 1.8.0_60
JVM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23, mixed mode)
Java Vendor: Oracle Corporation

elasticsearch.yml

bootstrap.mlockall: true

indices.breaker.total.limit: 70%
indices.fielddata.cache.size: 70%

Logs:
[2015-11-07 20:30:27,688][WARN ][monitor.jvm ] [ES-PROD-12-data] [gc][old][205968][5993] duration [22.4s], collections [1]/[23.1s], total [22.4s]/[44.8m], memory [3.4gb]->[3.1gb]/[3.4gb], all_pools {[young] [252.1mb]->[2.2kb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.1gb]->[3.1gb]/[3.1gb]}
[2015-11-07 20:31:02,485][WARN ][monitor.jvm ] [ES-PROD-12-data] [gc][old][205969][5994] duration [33.7s], collections [1]/[34.7s], total [33.7s]/[45.3m], memory [3.1gb]->[3.1gb]/[3.4gb], all_pools {[young] [2.2kb]->[116.2kb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.1gb]->[3.1gb]/[3.1gb]}
[2015-11-07 20:31:26,313][WARN ][monitor.jvm ] [ES-PROD-12-data] [gc][old][205970][5995] duration [22.6s], collections [1]/[23.8s], total [22.6s]/[45.7m], memory [3.1gb]->[3.1gb]/[3.4gb], all_pools {[young] [116.2kb]->[6.2mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.1gb]->[3.1gb]/[3.1gb]}
[2015-11-07 20:32:01,125][WARN ][monitor.jvm ] [ES-PROD-12-data] [gc][old][205972][5996] duration [33.3s], collections [1]/[33.7s], total [33.3s]/[46.3m], memory [3.4gb]->[3.1gb]/[3.4gb], all_pools {[young] [266.2mb]->[5.5mb]/[266.2mb]}{[survivor] [8.5mb]->[0b]/[33.2mb]}{[old] [3.1gb]->[3.1gb]/[3.1gb]}
[2015-11-07 20:32:24,718][WARN ][monitor.jvm ] [ES-PROD-12-data] [gc][old][205974][5997] duration [22.2s], collections [1]/[22.5s], total [22.2s]/[46.6m], memory [3.4gb]->[3.1gb]/[3.4gb], all_pools {[young] [266.2mb]->[9.4mb]/[266.2mb]}{[survivor] [15.3mb]->[0b]/[33.2mb]}{[old] [3.1gb]->[3.1gb]/[3.1gb]}
[2015-11-07 20:32:58,937][WARN ][monitor.jvm ] [ES-PROD-12-data] [gc][old][205976][5998] duration [32.5s], collections [1]/[33.2s], total [32.5s]/[47.2m], memory [3.4gb]->[3.1gb]/[3.4gb], all_pools {[young] [254.3mb]->[17.4mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.1gb]->[3.1gb]/[3.1gb]}

GC kicks in every minute and takes around have a minute, eventually the queries queue up and we start getting the exceptions in the Search Nodes

org.elasticsearch.transport.RemoteTransportException: [ES-PROD-12-data][inet[/10.1.0.7:9300]][indices:data/read/search[phase/query]]
Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution (queue capacity 1000) on org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler@5b5e7b85
at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:62)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:79)
at org.elasticsearch.transport.netty.MessageChannelHandler.handleRequest(MessageChannelHandler.java:224)


(Srinath C) #2

Hi Raminder,

I think indices.fielddata.cache.size: 70% is too high. Try reducing that to
a very low value and instead of that set doc_values on all you attributes.
This should significantly reduce the load on the heap. The disk utilization
will increase though.

Regards,
Srinath


(Ramindar Singh) #3

Thanks Srinath.
I reduced the indices.fielddata.cache.size: to 10% now, and the heap seems to happy for few days. Post that everyday the Heap size seems to be growing again. even thought the data remains the same. Not sure if i regularly run optimize over the indexes, would that help ?


(Srinath C) #4

Reducing indices.fielddata.cache.size: to 10% was not the fix. Setting
doc_values on properties in the index mappings was the solution. With that
there will be a reduced load on memory. Did you set doc_values and try?


(Ramindar Singh) #5

Hi Srinath,

I tried with the doc_values as well. but the problem still persists. After few hours the old gen memory grows to its limit and GC kicks in and same issue starts again. also there is another error

org.elasticsearch.transport.RemoteTransportException: [ES-PROD-9-data][inet[/10.1.0.18:9300]][indices:data/read/search[phase/fetch/id]]
Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [1068983]
at org.elasticsearch.search.SearchService.findContext(SearchService.java:537)
at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:507)
at org.elasticsearch.search.action.SearchServiceTransportAction$FetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:868)
at org.elasticsearch.search.action.SearchServiceTransportAction$FetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:862)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:279)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

I took the heap dump and observed that most of the memory, 25% is used by ConcurrentHahMap and 25% is used by string. Not sure what is happening :frowning:


(Ramindar Singh) #6

I thought memory is the problem, so i tried to increase the VM size to A6 having 28gb memory, but still i got the same issue.

Search Thread count goes to 1000 and then huge set of exceptions

org.elasticsearch.transport.RemoteTransportException:
[ES-PROD-12-data][inet[/10.1.0.7:9300]][indices:data/read/search[phase/query]]
Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution (queue capacity 1000) on

any help would be greatly appreciated.


(system) #7