Should CircuitBreakingException cause the node to become failed?

Hello

I'm running a 7.5.2 cluster. Today, while snapshots were being made (although I'm not sure it was snapshotting that triggered the issue), one of the nodes logged CircuitBreakingException and became failed for several minutes (Elasticsearch process never stopped running), which then made the cluster go yellow and broke snapshotting.

The logs:

[2020-02-25T00:08:51,031][WARN ][o.e.c.r.a.AllocationService] [elastic-logs3-p-master-2] failing shard [failed shard, shard [index-name-2020-02-25][1], node[HccOLzpQS2CKfsTNkZ9s0Q], [R], s[STARTED], a[id=nHmzdFniRDW-_5RCZtnFlg], message [failed to perform indices:data/write/bulk[s] on replica [index-name-2020-02-25][1], node[HccOLzpQS2CKfsTNkZ9s0Q], [R], s[STARTED], a[id=nHmzdFniRDW-_5RCZtnFlg]], failure [RemoteTransportException[[elastic-logs3-p-hotdata-1][172.30.1.168:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [30613028660/28.5gb], which is larger than the limit of [30601641984/28.5gb], real usage: [30612997120/28.5gb], new bytes reserved: [31540/30.8kb], usages [request=0/0b, fielddata=53395/52.1kb, in_flight_requests=31540/30.8kb, accounting=542812105/517.6mb]]; ], markAsStale [true]]
org.elasticsearch.transport.RemoteTransportException: [elastic-logs3-p-hotdata-1][172.30.1.168:9300][indices:data/write/bulk[s][r]]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [30613028660/28.5gb], which is larger than the limit of [30601641984/28.5gb], real usage: [30612997120/28.5gb], new bytes reserved: [31540/30.8kb], usages [request=0/0b, fielddata=53395/52.1kb, in_flight_requests=31540/30.8kb, accounting=542812105/517.6mb]
(...)
[2020-02-25T00:08:51,053][WARN ][o.e.g.G.InternalReplicaShardAllocator] [elastic-logs3-p-master-2] [index-name-2020-02-25][1]: failed to list shard for shard_store on node [HccOLzpQS2CKfsTNkZ9s0Q]
org.elasticsearch.action.FailedNodeException: Failed node [HccOLzpQS2CKfsTNkZ9s0Q]
(...)
Caused by: org.elasticsearch.transport.RemoteTransportException: [elastic-logs3-p-hotdata-1][172.30.1.168:9300][internal:cluster/nodes/indices/shard/store[n]]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [30612997414/28.5gb], which is larger than the limit of [30601641984/28.5gb], real usage: [30612997120/28.5gb], new bytes reserved: [294/294b], usages [request=0/0b, fielddata=53395/52.1kb, in_flight_requests=294/294b, accounting=542812105/517.6mb]

The circuit breakers' values:

"breakers" : {
  "request" : {
    "limit_size_in_bytes" : 19327352832,
    "limit_size" : "18gb",
    "estimated_size_in_bytes" : 0,
    "estimated_size" : "0b",
    "overhead" : 1.0,
    "tripped" : 0
  },
  "fielddata" : {
    "limit_size_in_bytes" : 12884901888,
    "limit_size" : "12gb",
    "estimated_size_in_bytes" : 34224,
    "estimated_size" : "33.4kb",
    "overhead" : 1.03,
    "tripped" : 0
  },
  "in_flight_requests" : {
    "limit_size_in_bytes" : 32212254720,
    "limit_size" : "30gb",
    "estimated_size_in_bytes" : 4139189,
    "estimated_size" : "3.9mb",
    "overhead" : 2.0,
    "tripped" : 0
  },
  "accounting" : {
    "limit_size_in_bytes" : 32212254720,
    "limit_size" : "30gb",
    "estimated_size_in_bytes" : 326626657,
    "estimated_size" : "311.4mb",
    "overhead" : 1.0,
    "tripped" : 0
  },
  "parent" : {
    "limit_size_in_bytes" : 30601641984,
    "limit_size" : "28.5gb",
    "estimated_size_in_bytes" : 20485356536,
    "estimated_size" : "19gb",
    "overhead" : 1.0,
    "tripped" : 2454
  }
},

The heap-usage spike is visible at around 01:09 browser time (00:09 VM time):

It is my understanding that CircuitBreakingException should prevent heap-related failures. So if it was triggered to prevent the node from failure then why the node became unavailable?

The other question is what steps should I take to prevent the issue from occuring again? Increasing the heap would seem obvious, but not so much when considering it's already set to 30GB with the whole VM running on 64GB of RAM (it's a Elasticsearch-dedicated VM, no other load).

Thanks!

Do you have any custom circuit breaker settings?

You mean those parameteres? https://www.elastic.co/guide/en/elasticsearch/reference/current/circuit-breaker.html
None that I would know of.

I now believe that in this case snapshotting was unrelated to the CircuitBreakingException. Today snapshotting failed as well. After investigation I found it being caged by troughput limit and eventually hitting timeout. No CircuitBreakingException this time.

The original questions regarding CircuitBreakingException still remain, though:

  • If it was triggered to prevent the node from failure then why the node became unavailable?
  • What steps should I take to prevent the issue from occuring again?

Hi @rsyne,

other users have reported similar circuit breaker exceptions when using G1 GC. I wonder if you are using G1? If so, you may want to double check that you have:

-XX:G1ReservePercent=25
-XX:InitiatingHeapOccupancyPercent=30

set in jvm.options (good to double check that they are included in the "JVM Args" log line that Elasticsearch outputs to its log file).

@HenningAndersen so far I have not moved to G1C. My GC-related JVM settings are as follows:

-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly

With the whole JVM args being:

 JVM arguments [-Des.networkaddress.cache.ttl=60, -Des.networkaddress.cache.negative.ttl=10, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -D
file.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dio.netty.allocator.numDirectArenas=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Djava.locale.providers=COMPAT, -Xms30g, -Xmx30g, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Djava.io.tmpdir=/tmp/elasticsearch-17301303172088226774, -XX:HeapDumpPath=/var/lib/elasticsearch, -Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m, -Djava.locale.providers=COMPAT, -XX:MaxDirectMemorySize=16106127360, -Des.path.home=/usr/share/elasticsearch, -Des.path.conf=/etc/elasticsearch, -Des.distribution.flavor=default, -Des.distribution.type=deb, -Des.bundled_jdk=true]

I can move to G1C if it were to help my case, but would it?
Anyway I'd be happy to have my original questions answered, as well.

Thanks!

Hi @rsyne,

we have no indications that switching to G1 should help here. We do know of some issues on G1 (like getting those settings wrong), but for CMS, the defaults should be fine.

Regarding your questions:

If it was triggered to prevent the node from failure then why the node became unavailable?

I cannot see from the log snippets here that the node did not stay available. I believe it did recover as well. The circuit breaker will reject requests while the circuit is broken by restore it once the issue has cleared up. This looks like it worked more or less as it should. The unfortunate part is that one of the requests that was circuit broken was a replication request and this by default causes the shard to be failed, i.e., removed from this node and recovered into another node.

What steps should I take to prevent the issue from occuring again?

It looks like there was a significant load on this cluster during this (purely based on the heap graphs), causing the GC to not be able to keep up. First of all, if this is running in a container or something similar, you may want to double check that ES and the JVM sees the right amount of processors. Second, purely from looking at this graph, it looks like your average permanent heap usage (i.e., the bottom of the spikes) is well below 50%. You may want to try to tweak -XX:CMSInitiatingOccupancyFraction, for instance setting it to -XX:CMSInitiatingOccupancyFraction=70.

You could also look into the actual activity at the time to see if there is anything that stands out. It might be a bad search that could be rewritten (or something else).

Adding more nodes could also help spread out the load.

@HenningAndersen that's a good insight. Thank you!

The node did indeed recover. And the proper wording is indeed "failed" (instead of "unavailable"). When the node becomes failed and re-sharding starts, the cluster obviously goes yellow, as well.

The cluster runs on tiered VMs - hotdata, colddata and masters. It's the hotdata nodes that become failed and these are running on 8 vCPUs, 64GB of RAM and 2TB of NVMe SSD.

After some additional investigation over the previous days I can say that the original diagnosis was more accurate than the one that followed, although it was not 100% correct. The story is that just after midnight, when load on the cluster is quite small, index snapshotting starts. Most indices are snapshotted without issues as these are relatively small. However, there are two indices that are large (~50GB on primaries, ~100GB on storage total - per index that is). And it happens (not daily, but quite often) that it is too much for the cluster and CircuitBreakingException occurs, which then causes re-sharding, which then causes snapshotting to fail.

As of now neither decreasing size of these indices nor adding extra nodes is unlikely, so, except for trying to meddle with InitiatingOccupancyFraction, I'm out of ideas. Do you have any suggestions?

How to make it easier for the cluster to snaphot such a large index?

Hi @rsyne,

a snapshot repository has an option to throttle the snapshot process (max_snapshot_bytes_per_sec), see for instance: https://www.elastic.co/guide/en/elasticsearch/plugins/current/repository-s3-repository.html.

I wonder if this has been increased from default? Might be worth lowering it too, just to experiment and see if this helps.

To help diagnose further, it could be good to gain insight into the heap when this happens. Would it be possible to do a jmap -histo:file= while these snapshots are running? I am not entirely convinced that the larger shards are the issue, do you have additional info supporting that claim?

max_snapshot_bytes_per_sec was decreased some time ago to 25 MBps, so I think decreasing it even more is not a very good idea.

It's simply that if the problem occurs, it always occurs when snapshotting the same index, which is the largest from the group that is being snapshotted daily. These indices are created per day, so do not contain exactly the same data (which excludes the possibility of some data corruption that could persist and cause issues), but their size is more or less similar each day:

1 1    2045332      0    2.5gb    1.2gb
1 1     567187      0    1.1gb  593.4mb
1 1    1274622      0    1.2gb  647.9m
3 1     331744      0  338.4mb  169.2mb
3 1   43276147      0   97.6gb   48.7gb <--- this is the one
3 1       9686      0   12.1mb      6mb
3 1   20818765      0   54.4gb   27.1gb
3 1       2144      0    3.1mb    1.4mb
3 1   28176126      0   24.1gb     12gb
3 1   90480338      0   91.6gb   45.7gb
3 1       1977      0      3mb    1.4mb
3 1    1271284      0    1.5gb  800.5mb
3 1   21794327      0   28.3gb   14.1gb
3 1        439      0  812.4kb  398.3kb
3 1        439      0  893.7kb  454.5kb
3 1        436      0  953.1kb  476.5kb

Sure dumping the heap is something I might do if necessary.

Yeah, if it has already been lowered, I think there is no need to lower it more.

Let us try to do the jmap histogram when you have time for it. Notice that it does not produce a heap dump, only a histogram, which I imagine is less concerning sharing here (feel free to also PM It to me instead). We may need to move to heapdump later, but let us start with this.

Sure, I think I can do that. Although I think I'd need to automate it to run on all nodes in the cluster, as it's not always the same node that the exception occurs on. And at give-or-take the time when it does. And then we'd have to wait for the exception, as it does not occur daily.

Anyway, is the jmap tool included in the ELK packages? So far Elasticsearch is running on its own Java and I think it might be missing the tool:

# /usr/share/elasticsearch/jdk/bin/java --version
openjdk 13.0.1 2019-10-15
OpenJDK Runtime Environment AdoptOpenJDK (build 13.0.1+9)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 13.0.1+9, mixed mode, sharing)

@rsyne, which package did you install?

Did you try:

/usr/share/elasticsearch/jdk/bin/jmap

Oh, it's there. Sorry for not noticing.
Just for clarity - I used the generic repo: https://artifacts.elastic.co/packages/7.x/apt and its package ii elasticsearch 7.5.2 amd64 Distributed RESTful search engine built for the cloud.

The issue has not occured in the days since we last talked. Will monitor it.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.