I have a ES cluster with 16 nodes in Azure
3 Master (A2 machine - 3.5GB mem)
3 Search (A2 machine - 3.5GB mem)
10 Data Nodes (A3 machine - 7 GB mem)
In total 6 indexes
Index Doc Size Shards Replicas Status
index-1 64,068 249.6MB 2 2 open
index-2 93 2.6MB 2 2 open
index-3 24,502 24.4MB 2 2 open
index-4 12,697 4.3MB 2 2 open
index-5 17,556 7.0MB 2 2 open
index-6 347,483 92.8MB 2 3 open
Elastic Search Version - 1.7.2
OS: Windows Server 2012 R2 (6.3)
Java Version: 1.8.0_60
JVM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23, mixed mode)
Java Vendor: Oracle Corporation
elasticsearch.yml
bootstrap.mlockall: true
indices.breaker.total.limit: 70%
indices.fielddata.cache.size: 70%
Logs:
[2015-11-07 20:30:27,688][WARN ][monitor.jvm ] [ES-PROD-12-data] [gc][old][205968][5993] duration [22.4s], collections [1]/[23.1s], total [22.4s]/[44.8m], memory [3.4gb]->[3.1gb]/[3.4gb], all_pools {[young] [252.1mb]->[2.2kb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.1gb]->[3.1gb]/[3.1gb]}
[2015-11-07 20:31:02,485][WARN ][monitor.jvm ] [ES-PROD-12-data] [gc][old][205969][5994] duration [33.7s], collections [1]/[34.7s], total [33.7s]/[45.3m], memory [3.1gb]->[3.1gb]/[3.4gb], all_pools {[young] [2.2kb]->[116.2kb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.1gb]->[3.1gb]/[3.1gb]}
[2015-11-07 20:31:26,313][WARN ][monitor.jvm ] [ES-PROD-12-data] [gc][old][205970][5995] duration [22.6s], collections [1]/[23.8s], total [22.6s]/[45.7m], memory [3.1gb]->[3.1gb]/[3.4gb], all_pools {[young] [116.2kb]->[6.2mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.1gb]->[3.1gb]/[3.1gb]}
[2015-11-07 20:32:01,125][WARN ][monitor.jvm ] [ES-PROD-12-data] [gc][old][205972][5996] duration [33.3s], collections [1]/[33.7s], total [33.3s]/[46.3m], memory [3.4gb]->[3.1gb]/[3.4gb], all_pools {[young] [266.2mb]->[5.5mb]/[266.2mb]}{[survivor] [8.5mb]->[0b]/[33.2mb]}{[old] [3.1gb]->[3.1gb]/[3.1gb]}
[2015-11-07 20:32:24,718][WARN ][monitor.jvm ] [ES-PROD-12-data] [gc][old][205974][5997] duration [22.2s], collections [1]/[22.5s], total [22.2s]/[46.6m], memory [3.4gb]->[3.1gb]/[3.4gb], all_pools {[young] [266.2mb]->[9.4mb]/[266.2mb]}{[survivor] [15.3mb]->[0b]/[33.2mb]}{[old] [3.1gb]->[3.1gb]/[3.1gb]}
[2015-11-07 20:32:58,937][WARN ][monitor.jvm ] [ES-PROD-12-data] [gc][old][205976][5998] duration [32.5s], collections [1]/[33.2s], total [32.5s]/[47.2m], memory [3.4gb]->[3.1gb]/[3.4gb], all_pools {[young] [254.3mb]->[17.4mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [3.1gb]->[3.1gb]/[3.1gb]}
GC kicks in every minute and takes around have a minute, eventually the queries queue up and we start getting the exceptions in the Search Nodes
org.elasticsearch.transport.RemoteTransportException: [ES-PROD-12-data][inet[/10.1.0.7:9300]][indices:data/read/search[phase/query]]
Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution (queue capacity 1000) on org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler@5b5e7b85
at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:62)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:79)
at org.elasticsearch.transport.netty.MessageChannelHandler.handleRequest(MessageChannelHandler.java:224)