7.4.0 Circuit breaking exceptions

Hi,

I'm seeing quite a few Circuit Breaking Exceptions happening in my cluster when indexing new data. I'm feeling a bit out of my depth here, so any advice is greatly appreciated!!! :slight_smile:

The errors:

[2019-10-10T15:11:27,499][DEBUG][o.e.a.a.i.s.TransportIndicesStatsAction] [es-2] failed to execute [indices:monitor/stats] on node [pCnSnJnkQNeKTOpOSbENwQ]
org.elasticsearch.transport.RemoteTransportException: [es-1][172.xxx.xxx.xxx:9300][indices:monitor/stats[n]]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [16722733252/15.5gb], which is larger than the limit of [16320875724/15.1gb], real usage: [16722703920/15.5gb], new bytes reserved: [29332/28.6kb], usages [request=0/0b, fielddata=320373647/305.5mb, in_flight_requests=389344108/371.3mb, accounting=33927918/32.3mb]
        at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:342) ~[elasticsearch-7.4.0.jar:7.4.0]
        at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:128) ~[elasticsearch-7.4.0.jar:7.4.0]
        ...

My setup is as following:

  • 4x identical AWS nodes, all master and ingest elligible
  • Ubuntu server 18.04
  • 8 CPU, 32G RAM on each node
  • 300GB storage on internal NVME SSD
  • Heap size 16G on each node
  • 580m total documents over 141 indices
  • 2 shards and 1 replica per index
  • 115G total index storage

I'm also seeing quite a few messages in the logs relating to garbage collection, and these numbers seem far higher than I'm comfortable with:

[2019-10-10T14:43:20,369][INFO ][o.e.m.j.JvmGcMonitorService] [es-2] [gc][71900] overhead, spent [433ms] collecting in the last [1s]
[2019-10-10T14:50:12,689][WARN ][o.e.m.j.JvmGcMonitorService] [es-2] [gc][72312] overhead, spent [593ms] collecting in the last [1s]

With this as a possible issue, I've tried switching to G1GC garbage collection as I've read that can improve performance as long as it's supported by your JVM version. However this does not seem to have had any affect in performance.

The documents and mapping seem fairly standard, other than than they contain one large nested field which can sometimes contain up to ~1000 nested items - though most have significantly less than this. We were previously indexing a very similar mapping in a smaller v1.7 cluster without any issues.

We're using quite a few bulk script updates to change data on specific items within the nested items (the idea was to get around the overhead of having to push in the entire nested object with a standard update). So my current thinking is that there is something with the nested objects and scripts which is causing the Garbage collection to go mad, but as far as I can see the heap looks OK?!

My JVM options:

# Heap size
-Xms16g
-Xmx16g

## GC configuration
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly

## G1GC Configuration
# NOTE: G1GC is only supported on JDK version 10 or later.
# To use G1GC uncomment the lines below.
10-:-XX:-UseConcMarkSweepGC
10-:-XX:-UseCMSInitiatingOccupancyOnly
10-:-XX:+UseG1GC
10-:-XX:G1ReservePercent=25
10-:-XX:InitiatingHeapOccupancyPercent=30

## DNS cache policy
-Des.networkaddress.cache.negative.ttl=10

## optimizations

# pre-touch memory pages used by the JVM during initialization
-XX:+AlwaysPreTouch

## basic

# explicitly set the stack size
-Xss1m

# set to headless, just in case
-Djava.awt.headless=true

# ensure UTF-8 encoding by default (e.g. filenames)
-Dfile.encoding=UTF-8

# use our provided JNA always versus the system one
-Djna.nosys=true

# turn off a JDK optimization that throws away stack traces for common
# exceptions because stack traces are important for debugging
-XX:-OmitStackTraceInFastThrow

# flags to configure Netty
-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dio.netty.recycler.maxCapacityPerThread=0
-Dio.netty.allocator.numDirectArenas=0

# log4j 2
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true

-Djava.io.tmpdir=${ES_TMPDIR}

## heap dumps

# generate a heap dump when an allocation from the Java heap fails
# heap dumps are created in the working directory of the JVM
-XX:+HeapDumpOnOutOfMemoryError

# specify an alternative path for heap dumps; ensure the directory exists and
# has sufficient space
-XX:HeapDumpPath=/media/storage/elasticsearch

# specify an alternative path for JVM fatal error logs
-XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log

## JDK 8 GC logging

8:-XX:+PrintGCDetails
8:-XX:+PrintGCDateStamps
8:-XX:+PrintTenuringDistribution
8:-XX:+PrintGCApplicationStoppedTime
8:-Xloggc:/var/log/elasticsearch/gc.log
8:-XX:+UseGCLogFileRotation
8:-XX:NumberOfGCLogFiles=32
8:-XX:GCLogFileSize=64m

# JDK 9+ GC logging
9-:-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m
9-:-Djava.locale.providers=COMPAT

Hi @MattJones,

I have a few initial questions that I hope can help us get closer to the problem cause:

  1. Which java version are you using? The bundled java 13 version I would assume?
  2. Does the problem occur in periods, i.e., you get a period of having circuit breaker exceptions? If so, can you estimate the duration of this from the log file?
  3. I notice you have GC logging enabled. To get the JVMs perspective on the heap usage, it could be good to have a peek at the GC log for the time period where this occurs.

Hi @HenningAndersen,

  1. Yes we're running the bundled Java 13
  2. Yes it does seem to happen in periods but it it's hard to estimate how long this happens for. We have quite a heavy indexing workload and it definitely seems to happen when the cluster is trying to index/update a large amount of documents at one time. From visually watching the cluster what I can say is that a node will disappear for around 30s before rejoining. I'll dig a litter deeper into the logs and see if I can get a more exact number for you.
  3. Yes here's the logs from the last time GC overhead was mentioned:

ES Server log:

[2019-10-14T13:02:04,566][INFO ][o.e.m.j.JvmGcMonitorService] [es-1] [gc][566688] overhead, spent [409ms] collecting in the last [1s]

GC Log for those few seconds:

[2019-10-14T13:02:00.829+0000][1324][safepoint    ] Safepoint "Cleanup", Time since last: 1000097594 ns, Reaching safepoint: 161326 ns, At safepoint: 6418 ns, Total: 167744 ns
[2019-10-14T13:02:00.893+0000][1324][safepoint    ] Safepoint "RevokeBias", Time since last: 63230791 ns, Reaching safepoint: 152527 ns, At safepoint: 28832 ns, Total: 181359 ns
[2019-10-14T13:02:01.893+0000][1324][safepoint    ] Safepoint "Cleanup", Time since last: 1000093720 ns, Reaching safepoint: 146121 ns, At safepoint: 11758 ns, Total: 157879 ns
[2019-10-14T13:02:02.893+0000][1324][safepoint    ] Safepoint "Cleanup", Time since last: 1000092149 ns, Reaching safepoint: 144610 ns, At safepoint: 10323 ns, Total: 154933 ns
[2019-10-14T13:02:03.893+0000][1324][safepoint    ] Safepoint "Cleanup", Time since last: 1000093432 ns, Reaching safepoint: 146883 ns, At safepoint: 11757 ns, Total: 158640 ns
[2019-10-14T13:02:04.894+0000][1324][safepoint    ] Safepoint "Cleanup", Time since last: 1000108003 ns, Reaching safepoint: 162810 ns, At safepoint: 18143 ns, Total: 180953 ns
[2019-10-14T13:02:05.894+0000][1324][safepoint    ] Safepoint "Cleanup", Time since last: 1000098929 ns, Reaching safepoint: 195874 ns, At safepoint: 4647 ns, Total: 200521 ns
[2019-10-14T13:02:06.894+0000][1324][safepoint    ] Safepoint "Cleanup", Time since last: 1000094270 ns, Reaching safepoint: 158138 ns, At safepoint: 10276 ns, Total: 168414 ns
[2019-10-14T13:02:07.894+0000][1324][safepoint    ] Safepoint "Cleanup", Time since last: 1000116874 ns, Reaching safepoint: 194562 ns, At safepoint: 24835 ns, Total: 219397 ns
[2019-10-14T13:02:08.018+0000][1324][gc,start     ] GC(187626) Pause Young (Concurrent Start) (G1 Humongous Allocation)
[2019-10-14T13:02:08.018+0000][1324][gc,task      ] GC(187626) Using 8 workers of 8 for evacuation
[2019-10-14T13:02:08.018+0000][1324][gc,age       ] GC(187626) Desired survivor size 419430400 bytes, new threshold 15 (max threshold 15)
[2019-10-14T13:02:08.041+0000][1324][gc,age       ] GC(187626) Age table with threshold 15 (max threshold 15)
[2019-10-14T13:02:08.041+0000][1324][gc,age       ] GC(187626) - age   1:   92586352 bytes,   92586352 total
[2019-10-14T13:02:08.041+0000][1324][gc,age       ] GC(187626) - age   2:   49861920 bytes,  142448272 total
[2019-10-14T13:02:08.041+0000][1324][gc,age       ] GC(187626) - age   3:   20708848 bytes,  163157120 total
[2019-10-14T13:02:08.041+0000][1324][gc,phases    ] GC(187626)   Pre Evacuate Collection Set: 1.2ms
[2019-10-14T13:02:08.041+0000][1324][gc,phases    ] GC(187626)   Evacuate Collection Set: 19.5ms
[2019-10-14T13:02:08.041+0000][1324][gc,phases    ] GC(187626)   Post Evacuate Collection Set: 1.6ms
[2019-10-14T13:02:08.041+0000][1324][gc,phases    ] GC(187626)   Other: 0.7ms
[2019-10-14T13:02:08.041+0000][1324][gc,heap      ] GC(187626) Eden regions: 739->0(822)
[2019-10-14T13:02:08.041+0000][1324][gc,heap      ] GC(187626) Survivor regions: 22->22(100)
[2019-10-14T13:02:08.041+0000][1324][gc,heap      ] GC(187626) Old regions: 291->291
[2019-10-14T13:02:08.041+0000][1324][gc,heap      ] GC(187626) Archive regions: 2->2
[2019-10-14T13:02:08.041+0000][1324][gc,heap      ] GC(187626) Humongous regions: 107->107
[2019-10-14T13:02:08.041+0000][1324][gc,metaspace ] GC(187626) Metaspace: 118411K->118411K(1159168K)
[2019-10-14T13:02:08.041+0000][1324][gc           ] GC(187626) Pause Young (Concurrent Start) (G1 Humongous Allocation) 9271M->3363M(16384M) 23.063ms
[2019-10-14T13:02:08.041+0000][1324][gc,cpu       ] GC(187626) User=0.09s Sys=0.00s Real=0.02s
[2019-10-14T13:02:08.041+0000][1324][gc           ] GC(187627) Concurrent Cycle
[2019-10-14T13:02:08.042+0000][1324][safepoint    ] Safepoint "G1CollectForAllocation", Time since last: 123684635 ns, Reaching safepoint: 154403 ns, At safepoint: 23187029 ns, Total: 23341432 ns
[2019-10-14T13:02:08.042+0000][1324][gc,marking   ] GC(187627) Concurrent Clear Claimed Marks
[2019-10-14T13:02:08.042+0000][1324][gc,marking   ] GC(187627) Concurrent Clear Claimed Marks 0.247ms
[2019-10-14T13:02:08.042+0000][1324][gc,marking   ] GC(187627) Concurrent Scan Root Regions
[2019-10-14T13:02:08.047+0000][1324][gc,marking   ] GC(187627) Concurrent Scan Root Regions 5.529ms
[2019-10-14T13:02:08.047+0000][1324][gc,marking   ] GC(187627) Concurrent Mark (570547.006s)
[2019-10-14T13:02:08.047+0000][1324][gc,marking   ] GC(187627) Concurrent Mark From Roots
[2019-10-14T13:02:08.047+0000][1324][gc,task      ] GC(187627) Using 2 workers of 2 for marking
[2019-10-14T13:02:08.530+0000][1324][gc,marking   ] GC(187627) Concurrent Mark From Roots 482.685ms
[2019-10-14T13:02:08.530+0000][1324][gc,marking   ] GC(187627) Concurrent Preclean
[2019-10-14T13:02:08.533+0000][1324][gc,marking   ] GC(187627) Concurrent Preclean 3.115ms
[2019-10-14T13:02:08.533+0000][1324][gc,marking   ] GC(187627) Concurrent Mark (570547.006s, 570547.491s) 485.867ms
[2019-10-14T13:02:08.533+0000][1324][gc,start     ] GC(187627) Pause Remark
[2019-10-14T13:02:08.540+0000][1324][gc           ] GC(187627) Pause Remark 3514M->3418M(16384M) 6.934ms
[2019-10-14T13:02:08.540+0000][1324][gc,cpu       ] GC(187627) User=0.04s Sys=0.00s Real=0.01s
[2019-10-14T13:02:08.540+0000][1324][safepoint    ] Safepoint "G1Concurrent", Time since last: 491726331 ns, Reaching safepoint: 189774 ns, At safepoint: 7055691 ns, Total: 7245465 ns
[2019-10-14T13:02:08.541+0000][1324][gc,marking   ] GC(187627) Concurrent Rebuild Remembered Sets
[2019-10-14T13:02:08.848+0000][1324][gc,marking   ] GC(187627) Concurrent Rebuild Remembered Sets 307.836ms
[2019-10-14T13:02:08.849+0000][1324][gc,start     ] GC(187627) Pause Cleanup
[2019-10-14T13:02:08.850+0000][1324][gc           ] GC(187627) Pause Cleanup 3439M->3439M(16384M) 0.868ms
[2019-10-14T13:02:08.850+0000][1324][gc,cpu       ] GC(187627) User=0.00s Sys=0.00s Real=0.01s
[2019-10-14T13:02:08.850+0000][1324][safepoint    ] Safepoint "G1Concurrent", Time since last: 307914784 ns, Reaching safepoint: 203340 ns, At safepoint: 1001140 ns, Total: 1204480 ns

These look like some quite big numbers for garbage collection! Especially considering the cluster seems to be doing it every few seconds.