[7.0.1] New Circuitbreaker - Segment bigger than heap. Should this break?

Dear developers,

I'm wondering how the new circuit breaker is supposed to work in relation to shard size.

I see this circuitbreaker in my logs:

[2019-08-15T18:12:09,472][WARN ][o.e.i.c.IndicesClusterStateService] [mes-any-testssd-qa002-mes_any_testssd1] [[mfts-load][0]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [mfts-load][0]: Recovery failed from {mes-any-testssd-qa001-mes_any_testssd1}{TVCXPfduT76snBVeHaxYkA}{aZ9_GJcERZOv7IDbTvg6Cw}{10.90.26.21}{10.90.26.21:9300} into {mes-any-testssd-qa002-mes_any_testssd1}{PLAdWg1NQGq8V5TDWZkHQA}{8EtEQ6GFT-e4mqK-jKjPJA}{10.90.26.24}{10.90.26.24:9300}
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.lambda$doRecovery$2(PeerRecoveryTargetService.java:253) [elasticsearch-7.0.1.jar:7.0.1]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$1.handleException(PeerRecoveryTargetService.java:298) [elasticsearch-7.0.1.jar:7.0.1]
        at org.elasticsearch.transport.PlainTransportFuture.handleException(PlainTransportFuture.java:97) [elasticsearch-7.0.1.jar:7.0.1]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1124) [elasticsearch-7.0.1.jar:7.0.1]
        at org.elasticsearch.transport.TcpTransport.lambda$handleException$24(TcpTransport.java:1001) [elasticsearch-7.0.1.jar:7.0.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-7.0.1.jar:7.0.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:835) [?:?]
Caused by: org.elasticsearch.transport.RemoteTransportException: [mes-any-testssd-qa001-mes_any_testssd1][10.90.26.21:9300][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [31628069438/29.4gb], which is larger than the limit of [31621696716/29.4gb], real usage: [31627032296/29.4gb], new bytes reserved: [1037142/1012.8kb]
        at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:343) ~[elasticsearch-7.0.1.jar:7.0.1]
        at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:128) ~[elasticsearch-7.0.1.jar:7.0.1]
        at org.elasticsearch.transport.TcpTransport.handleRequest(TcpTransport.java:1026) ~[elasticsearch-7.0.1.jar:7.0.1]
        at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:922) ~[elasticsearch-7.0.1.jar:7.0.1]
        at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:753) ~[elasticsearch-7.0.1.jar:7.0.1]
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:53) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[?:?]
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323) ~[?:?]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[?:?]
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[?:?]
        at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1436) ~[?:?]
        at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1203) ~[?:?]
        at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1247) ~[?:?]
        at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(Byte

I have 31GB Heap allocated to the Elasticsearch process

If I understand it correctly, it breaks the request as it is bigger than the heap size. Does this mean that I can no longer have shards bigger than the size of my heap?

Thank you for your insight!

hey,

Elasticsearch stopped to process this request, because the memory required for this task was not available as other parts of the application have already taken up the JVM heap.

The main question here is, what is taking so much of your heap, as 31GB is quite a bit. checking out nodes stats might help you here.

--Alex

Hey,

Thank you for replying!

The cluster doesn't appear to be very memory pressured. It even garbagecollected to ~30% shortly before the event.
I noticed that the segments of the index got merged shortly before the circuit breaker fired. That test index at the time consisted of a single 55GB Shard with 1 Replica. Somewhere in the back of my head I have a vague recollection: Segments were transferred in one big piece in earlier versions of the software but that this behavior was changed in later versions to split them into more handle-able size (blockwise or something alike). Maybe I misunderstood something. At least I can't find anything in the docs that sounds like that.

Could this error stem from transferring one big segment that got merged to a size larger than 31GB?

Is there some knowledge base where I can find out how this replication mechanism works?

Thank you!

GNA



I can replicate this. It merged the segments, then the circuitbreakers fire.

My current theory:
Merging seems expensive in Disk-IO and CPU time. So most likely it happens only on the primary shard and the replica is synced via TransportShardBulkAction. But the segment it receives is too big and the circuitbreaker steps in.

Is this a valid usecase or can I tune something to transfer the new big segment in smaller pieces?

Thank you!

Hi,

The segments are not transferred in one chunk, they are broken up so the size of a segment should not be the issue here. What happens is that the recovery temporarily used too much heap.
Do you use a non-standard setting for the maximum recovery throughput or maximum number of in-flight chunks? If yes, I' suggest trying to reduce the value of that setting.
See https://www.elastic.co/guide/en/elasticsearch/reference/7.0/recovery.html for the settings I'm referring to.

Hi @gna,

are you using the default garbage collection settings or did you maybe change it into using ParallelOldGC? The choice of GC has an interplay with the real memory circuit breaker.

Hi @Armin_Braun,
Thank you for replying. I did not change any of these settings. Thank you for clarifying that the segments are broken in smaller blocks. Seems like I was jumping to conclusions there.

Hi @HenningAndersen,
Thank you for replying. Currently G1GC is in use, these are all the GC settings currently configured:

10-:-XX:+UseG1GC
10-:-XX:-UseCMSInitiatingOccupancyOnly
10-:-XX:-UseConcMarkSweepGC
10-:-XX:InitiatingHeapOccupancyPercent=75
-XX:CMSInitiatingOccupancyFraction=75

I'll try to dig deeper into the heap metrics on the next testrun. If I find nothing I'll test going back to CMS.

Thank you both for the suggestions, now I know what I can investigate further!

Hi @gna,

looking into G1 more, I think the problem is with the InitiatingHeapOccupancyPercent (IHOP). This option works differently from CMSInitiatingOccupancyFraction, in that IHOP is calculated as the old space occupation relative to the entire heap.

If you can experiment a little, I would suggest to run without the InitiatingHeapOccupancyPercent option, which would use the default (45) instead. Also, depending on workload, adding -XX:NewRatio=2 might be desirable too. In local experiments I needed both settings to avoid circuit breaking.

I would be very interested in the outcome of such experiments.

Hi @HenningAndersen,
thank you! I will test that. Currently I use this

    - '10-:-XX:G1ReservePercent=20'

what also seems to avoid circuit breaking. The test installation compresses to ~25% heapusage when doing "FullGC" (Concurrent). So currently it looks like it ran into the Circuitbreaker before being able to run a Concurrent GC. But did not run that many testruns yet to be completely sure.

Hi @gna,

I think using your option together with not specifying InitiatingHeapOccupancyPercent might be more desirable than tweaking the new generation. Thanks for the input here.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.