CircuitBreakingException: [parent] Data too large, data for [<transport_request>]

Hi,

I'm working on a 6.2.4 ES cluster with 8 VMs (2 physical servers).
Each VM has 28Go of memory, and 15Go are dedicated to ES.

My cluster contains 80 billions documents ( Store size = 16.11 TB / Fielddata size = 21.4GB), split across 600 daily indexes.
Each index has 2 shards and 1 replica.

Everyday, my system archives an old index by moving it from a SSD storage instance, to a HDD storage instance.
4 VMs are storing data on SSD, 4 other VMs are storing data on HDD.
The last ones are used to store older indexes.

Today I encountered an error :

[2018-08-08T12:04:35,814][WARN ][o.e.i.c.IndicesClusterStateService] [vm-8] [[index-2018-02-08][1]] marking and sending shard failed due to [failed recovery]
    org.elasticsearch.indices.recovery.RecoveryFailedException: [index-2018-02-08][1]: Recovery failed from {vm-6}{wPINhZZaTZmGwanoUW5rmA}{vV2Ob95NR9qtZspT_d2BsQ}{IP}{IP:PORT}{physical_server=physical-s2, storage=SSD} into {vm-8}{Bbz4NRC4SQ2h5GyzK5NQYw}{0GuO0WsRR0eSlu_TxE-A_w}{IP}{IP:PORT}{storage=HDD, physical_server=physical-s2}
            at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.doRecovery(PeerRecoveryTargetService.java:288) [elasticsearch-6.2.4.jar:6.2.4]
            at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.access$900(PeerRecoveryTargetService.java:81) [elasticsearch-6.2.4.jar:6.2.4]
            at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryRunner.doRun(PeerRecoveryTargetService.java:635) [elasticsearch-6.2.4.jar:6.2.4]
            at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) [elasticsearch-6.2.4.jar:6.2.4]
            at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.2.4.jar:6.2.4]
    ...
    Caused by: org.elasticsearch.transport.RemoteTransportException: [vm-6][IP:PORT][internal:index/shard/recovery/start_recovery]
    Caused by: org.elasticsearch.index.engine.RecoveryEngineException: Phase[1] phase1 failed
            at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:175) ~[elasticsearch-6.2.4.jar:6.2.4]
            at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:98) ~[elasticsearch-6.2.4.jar:6.2.4]
            at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$000(PeerRecoverySourceService.java:50) ~[elasticsearch-6.2.4.jar:6.2.4]
            at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:107) ~[elasticsearch-6.2.4.jar:6.2.4]
            at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:104) ~[elasticsearch-6.2.4.jar:6.2.4]
            at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[elasticsearch-6.2.4.jar:6.2.4]
    ...
    Caused by: org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: Failed to transfer [104] files with total size of [8.8gb]
            at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:419) ~[elasticsearch-6.2.4.jar:6.2.4]
            at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:173) ~[elasticsearch-6.2.4.jar:6.2.4]
            at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:98) ~[elasticsearch-6.2.4.jar:6.2.4]
            at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$000(PeerRecoverySourceService.java:50) ~[elasticsearch-6.2.4.jar:6.2.4]
            at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:107) ~[elasticsearch-6.2.4.jar:6.2.4]
            at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:104) ~[elasticsearch-6.2.4.jar:6.2.4]
    ...
    Caused by: org.elasticsearch.transport.RemoteTransportException: [vm-8][IP:PORT][internal:index/shard/recovery/file_chunk]
    Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [11260978100/10.4gb], which is larger than the limit of [11243782144/10.4gb]
            at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:230) ~[elasticsearch-6.2.4.jar:6.2.4]
            at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:128) ~[elasticsearch-6.2.4.jar:6.2.4]
            at org.elasticsearch.transport.TcpTransport.handleRequest(TcpTransport.java:1502) ~[elasticsearch-6.2.4.jar:6.2.4]
            at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1382) ~[elasticsearch-6.2.4.jar:6.2.4]
            at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:64) ~[?:?]
            at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[?:?]
            at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[?:?]
            at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[?:?]
      ...

It seems that ES tries to send 104 files (8.8gb) to the HDD instance, but the total exceeds the defined limit (10.4gb)
Do you know if I can limit the number of files sent in a batch, or if I can increase the limit?

Has anyone encountered this problem?

Thanks for your help,
Greetings,
Guillaume.

It looks like you are running into heap limits as you are transferring data between nodes. What does the cluster nodes stats API give regarding the heap usage on the various nodes? What does the cluster stats API give for the cluster as a whole?

The node stats gives me interesting information (example of one random VM):

        "breakers": {
            "request": {
                "limit_size_in_bytes": 9637527552,
                "limit_size": "8.9gb",
                "estimated_size_in_bytes": 0,
                "estimated_size": "0b",
                "overhead": 1.0,
                "tripped": 0
            },
            "fielddata": {
                "limit_size_in_bytes": 9637527552,
                "limit_size": "8.9gb",
                "estimated_size_in_bytes": 3207562120,
                "estimated_size": "2.9gb",
                "overhead": 1.03,
                "tripped": 0
            },
            "in_flight_requests": {
                "limit_size_in_bytes": 16062545920,
                "limit_size": "14.9gb",
                "estimated_size_in_bytes": 2165,
                "estimated_size": "2.1kb",
                "overhead": 1.0,
                "tripped": 0
            },
            "accounting": {
                "limit_size_in_bytes": 16062545920,
                "limit_size": "14.9gb",
                "estimated_size_in_bytes": 7939158887,
                "estimated_size": "7.3gb",
                "overhead": 1.0,
                "tripped": 0
            },
            "parent": {
                "limit_size_in_bytes": 11243782144,
                "limit_size": "10.4gb",
                "estimated_size_in_bytes": 11146723172,
                "estimated_size": "10.3gb",
                "overhead": 1.0,
                "tripped": 13891
            }
        },

The heap seems fine to me :

            "mem": {
                "heap_used_in_bytes": 12186485616,
                "heap_used_percent": 75,
                "heap_committed_in_bytes": 16062545920,
                "heap_max_in_bytes": 16062545920,
                "non_heap_used_in_bytes": 157546952,
                "non_heap_committed_in_bytes": 165404672,
                "pools": {
                    "young": {
                        "used_in_bytes": 7246704,
                        "max_in_bytes": 349044736,
                        "peak_used_in_bytes": 349044736,
                        "peak_max_in_bytes": 349044736
                    },
                    "survivor": {
                        "used_in_bytes": 2848904,
                        "max_in_bytes": 43581440,
                        "peak_used_in_bytes": 43581440,
                        "peak_max_in_bytes": 43581440
                    },
                    "old": {
                        "used_in_bytes": 12176390008,
                        "max_in_bytes": 15669919744,
                        "peak_used_in_bytes": 12176907240,
                        "peak_max_in_bytes": 15669919744
                    }
                }
            },

The cluster stats gives me the following data :

HEAP :

          "mem" : {
            "heap_used" : "79.5gb",
            "heap_used_in_bytes" : 85380942184,
            "heap_max" : "119.4gb",
            "heap_max_in_bytes" : 128273481728
          },

OS :

          "mem" : {
            "total" : "224gb",
            "total_in_bytes" : 240518168576,
            "free" : "26.2gb",
            "free_in_bytes" : 28233650176,
            "used" : "197.7gb",
            "used_in_bytes" : 212284518400,
            "free_percent" : 12,
            "used_percent" : 88
          }

The memory seems to be the problem.
The circuit breakers shows me that the accounting circuit breaker + the fielddata are exceeding the parent circuit breaker.
Is it safe to increase the parent circuit breaker limit?
Can I limit the data taken into account by the accounting circuit breaker?

Thanks!

How much data do you have on the nodes? How many indices and shards?

Hope these information answers your question :

I'm working on a 6.2.4 ES cluster with 8 VMs (2 physical servers).
Each VM has 28Go of memory, and 15Go are dedicated to ES.

My cluster contains 80 billions documents ( Store size = 16.11 TB / Fielddata size = 21.4GB), split across 600 daily indices.
Each indice has 2 shards and 1 replica.

I saw that in ES 7.0, the parent circuit breaker limit changes from 70% to 95% (default behaviour).

Is it safe in 6.2 to switch from 70% to 95% ? Can it be a solution for my problem?

Thanks!

You may be able to increase it, but going that far may result in instability and nodes going OOM.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.