Hi,
I'm seeing quite a few Circuit Breaking Exceptions happening in my cluster when indexing new data. I'm feeling a bit out of my depth here, so any advice is greatly appreciated!!!
The errors:
[2019-10-10T15:11:27,499][DEBUG][o.e.a.a.i.s.TransportIndicesStatsAction] [es-2] failed to execute [indices:monitor/stats] on node [pCnSnJnkQNeKTOpOSbENwQ]
org.elasticsearch.transport.RemoteTransportException: [es-1][172.xxx.xxx.xxx:9300][indices:monitor/stats[n]]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [16722733252/15.5gb], which is larger than the limit of [16320875724/15.1gb], real usage: [16722703920/15.5gb], new bytes reserved: [29332/28.6kb], usages [request=0/0b, fielddata=320373647/305.5mb, in_flight_requests=389344108/371.3mb, accounting=33927918/32.3mb]
at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:342) ~[elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:128) ~[elasticsearch-7.4.0.jar:7.4.0]
...
My setup is as following:
- 4x identical AWS nodes, all master and ingest elligible
- Ubuntu server 18.04
- 8 CPU, 32G RAM on each node
- 300GB storage on internal NVME SSD
- Heap size 16G on each node
- 580m total documents over 141 indices
- 2 shards and 1 replica per index
- 115G total index storage
I'm also seeing quite a few messages in the logs relating to garbage collection, and these numbers seem far higher than I'm comfortable with:
[2019-10-10T14:43:20,369][INFO ][o.e.m.j.JvmGcMonitorService] [es-2] [gc][71900] overhead, spent [433ms] collecting in the last [1s]
[2019-10-10T14:50:12,689][WARN ][o.e.m.j.JvmGcMonitorService] [es-2] [gc][72312] overhead, spent [593ms] collecting in the last [1s]
With this as a possible issue, I've tried switching to G1GC
garbage collection as I've read that can improve performance as long as it's supported by your JVM version. However this does not seem to have had any affect in performance.
The documents and mapping seem fairly standard, other than than they contain one large nested
field which can sometimes contain up to ~1000 nested items - though most have significantly less than this. We were previously indexing a very similar mapping in a smaller v1.7 cluster without any issues.
We're using quite a few bulk script updates to change data on specific items within the nested items (the idea was to get around the overhead of having to push in the entire nested object with a standard update). So my current thinking is that there is something with the nested objects and scripts which is causing the Garbage collection to go mad, but as far as I can see the heap looks OK?!
My JVM options:
# Heap size
-Xms16g
-Xmx16g
## GC configuration
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
## G1GC Configuration
# NOTE: G1GC is only supported on JDK version 10 or later.
# To use G1GC uncomment the lines below.
10-:-XX:-UseConcMarkSweepGC
10-:-XX:-UseCMSInitiatingOccupancyOnly
10-:-XX:+UseG1GC
10-:-XX:G1ReservePercent=25
10-:-XX:InitiatingHeapOccupancyPercent=30
## DNS cache policy
-Des.networkaddress.cache.negative.ttl=10
## optimizations
# pre-touch memory pages used by the JVM during initialization
-XX:+AlwaysPreTouch
## basic
# explicitly set the stack size
-Xss1m
# set to headless, just in case
-Djava.awt.headless=true
# ensure UTF-8 encoding by default (e.g. filenames)
-Dfile.encoding=UTF-8
# use our provided JNA always versus the system one
-Djna.nosys=true
# turn off a JDK optimization that throws away stack traces for common
# exceptions because stack traces are important for debugging
-XX:-OmitStackTraceInFastThrow
# flags to configure Netty
-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dio.netty.recycler.maxCapacityPerThread=0
-Dio.netty.allocator.numDirectArenas=0
# log4j 2
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true
-Djava.io.tmpdir=${ES_TMPDIR}
## heap dumps
# generate a heap dump when an allocation from the Java heap fails
# heap dumps are created in the working directory of the JVM
-XX:+HeapDumpOnOutOfMemoryError
# specify an alternative path for heap dumps; ensure the directory exists and
# has sufficient space
-XX:HeapDumpPath=/media/storage/elasticsearch
# specify an alternative path for JVM fatal error logs
-XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log
## JDK 8 GC logging
8:-XX:+PrintGCDetails
8:-XX:+PrintGCDateStamps
8:-XX:+PrintTenuringDistribution
8:-XX:+PrintGCApplicationStoppedTime
8:-Xloggc:/var/log/elasticsearch/gc.log
8:-XX:+UseGCLogFileRotation
8:-XX:NumberOfGCLogFiles=32
8:-XX:GCLogFileSize=64m
# JDK 9+ GC logging
9-:-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m
9-:-Djava.locale.providers=COMPAT