Elasticsearch monitoring is returning circuit break errors

Cluster version 7.4.2
3 Master nodes
10 data nodes
We are having the monitoring data for our cluster go to itself, recently we started seeing these errors.

Can someone explain what this error means and potentially how to fix it.

2020-02-04T21:58:47,739][DEBUG][o.e.a.a.c.n.i.TransportNodesInfoAction] [nodeh01] failed to execute on node [_UgajvwaR9WLOLKx-oivqw]
org.elasticsearch.transport.RemoteTransportException: [nodew02][172.16.30.216:9300][cluster:monitor/nodes/info[n]]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [37492613018/34.9gb], which is larger than the limit of [36531394969/34gb], real usage: [37492606920/34.9gb], new bytes reserved: [6098/5.9kb], usages [request=0/0b, fielddata=9093989529/8.4gb, in_flight_requests=6098/5.9kb, accounting=10594194739/9.8gb]

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Hi @kyles,

are you using G1 GC? If so, please double check that you have:

-XX:G1ReservePercent=25
-XX:InitiatingHeapOccupancyPercent=30

in your jvm.options.

As a sidenote, I notice that you are using 34GB heap. This is not recommended, you will be able to hold more data and processing if lowering to something like 30GB, since that allows the JVM to use compressed OOPS. Please see: https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html for more details on this.

@HenningAndersen Thanks for the info. We havent touched anything besides setting the heap in our JVM file. Let me know what you think we should change, I have put the related part of our JVM file below.
Some more info about our cluster:
We run a hot warm cluster setup we our currently only seeing these errors on our warm nodes.

# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space

-Xms36g
-Xmx36g

################################################################
## Expert settings
################################################################
##
## All settings below this section are considered
## expert settings. Don't tamper with them unless
## you understand what you are doing
##
################################################################

## GC configuration
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly

## G1GC Configuration
# NOTE: G1GC is only supported on JDK version 10 or later.
# To use G1GC uncomment the lines below.
# 10-:-XX:-UseConcMarkSweepGC
# 10-:-XX:-UseCMSInitiatingOccupancyOnly
# 10-:-XX:+UseG1GC
# 10-:-XX:G1ReservePercent=25
# 10-:-XX:InitiatingHeapOccupancyPercent=30

Hi @kyles,

I think it will be beneficial to lower the heap to 31GB. You should double check that the JVM uses compressed oops as described in the link from my previous post.

With a heap of 36GB, all object references will use 8 bytes instead of 4 bytes. This is likely to add up to much more than the 5 GB extra space you get when using 36GB heap. Also, it is slightly slower, since more data needs to be fetched (and cached) from RAM into the CPU.

If the circuit breaking issue persists, it could be useful to see node stats and GC logs from when this happened. Also, is there significant activity going on against the warm tier?

We lowered the heap on the warm nodes to 31GB. We are getting the compression
A couple days later we started getting circuit breaker errors again. Most of our warm nodes stopped responding. Everything seemed to go back to normal once we deleted some data.

There shouldn't be any significant queries going against warm all of the time, they are searched when needed.

Here is a link to a gc log file, its about a two hour time frame, around one hour after and before.

Hi @kyles,

looking at the GC log, it seems to indicate that the heap usage is "permanent" (as in not collected by GC). Given that this clears up when deleting data, it makes sense to look at the heap usage you have.

The _nodes/stats endpoint will likely reveal more about this, but the original circuit breaker exception did indicate substantial fielddata use. You can use:

GET _nodes/stats?fielddata_fields=*

to get information on the memory consumption of individual fields. Maybe you can avoid loading fielddata by following some of the guidance here?

What is the full output of the cluster stats API?