Failed to clean async result

Hi
I met the case on transport transport_worker

in my workflow ingest node (mem capacity 36GB) received data from source and the next this data should be relocated on data nodes (16GB)

Can we control bulk of data between nodes for avoid the overload memory under "transport_worker" ?

environment: elasticsearch 8.1.0

from: destination

{"@timestamp":"2023-02-22T21:05:17.966Z", "log.level":"ERROR", "message":"failed to clean async result [Fk93UHNqek45VDltckwwZzFpTW9ROVEgdTRkejNENERSU0tWWXFJZzRlLVFiUToy=]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[es_data_ssd_3_1_ingest][transport_workerog.logger":"org.elasticsearch.xpack.core.async.DeleteAsyncResultsService","trace.id":"b5ae78782e50ff55a5e3fb93952aea31","elasticsearch.cluster.uuid":"XDEw48F5SEu3KcS3_jticsearch.node.id":"u4dz3D4DRSKVYqIg4e-QbQ","elasticsearch.node.name":"es_data_ssd_3_1_ingest","elasticsearch.cluster.name":"elk_cluster","error.type":"org.elasticsearc.RemoteTransportException","error.message":"[es_data_ssd_2_3][10.0.9.224:9300][indices:data/write/bulk[s]]","error.stack_trace":"org.elasticsearch.transport.RemoteTranson: [es_data_ssd_2_3][10.0.9.224:9300][indices:data/write/bulk[s]]\nCaused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data s:data/write/bulk[s]] would be [8474736228/7.8gb], which is larger than the limit of [8418135900/7.8gb], real usage: [8474735912/7.8gb], new bytes reserved: [316/316b],del_inference=0/0b, inflight_requests=316/316b, request=557056/544kb, fielddata=1449832688/1.3gb, eql_sequence=0/0b]\n\tat org.elasticsearch.indices.breaker.HierarchyCirService.checkParentLimit(HierarchyCircuitBreakerService.java:440)\n\tat org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMtBreaker.java:108)\n\tat org.elasticsearch.transport.InboundAggregator.checkBreaker(InboundAggregator.java:215)\n\tat org.elasticsearch.transport.InboundAggregator.finion(InboundAggregator.java:119)\n\tat org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:147)\n\tat org.elasticsearch.transport.InboundPipdleBytes(InboundPipeline.java:121)\n\tat org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:86)\n\tat org.elasticsearch.transport.netty4.NettynnelHandler.channelRead(Netty4MessageChannelHandler.java:74)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:3o.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelctChannelHandlerContext.java:357)\n\tat io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:280)\n\tat io.netty.channel.AbstractChannelHandlerContexnnelRead(AbstractChannelHandlerContext.java:379)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)\n\tat ionel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)\n\tat io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessjava:103)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)\n\tat io.netty.channel.AbstractChannelHandlerCoeChannelRead(AbstractChannelHandlerContext.java:365)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)\n\tat ndler.ssl.SslHandler.unwrap(SslHandler.java:1371)\n\tat io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1234)\n\tat io.netty.handler.ssl.SslHandler.andler.java:1283)\n\tat io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:510)\n\tat io.netty.handler.codec.ByteToMes.callDecode(ByteToMessageDecoder.java:449)\n\tat io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:279)\n\tat io.netty.channel.AbstractCerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContextn\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)\n\tat io.netty.channel.DefaultChannelPipeline$HeadContext.cDefaultChannelPipeline.java:1410)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)\n\tat io.netty.channel.nnelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)\n\tat io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:9o.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.jtat io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:623)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:586)etty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)\n\tat io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)\n\tat io.neternal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tat 

from: target gc logs

[2023-02-22T21:05:11.679+0000][7][gc             ] GC(749076) Pause Full (G1 Compaction Pause) 8168M->7467M(8192M) 910.329ms
[2023-02-22T21:05:11.680+0000][7][gc,cpu         ] GC(749076) User=21.93s Sys=2.28s Real=0.91s
[2023-02-22T21:05:11.680+0000][7][safepoint      ] Safepoint "G1CollectForAllocation", Time since last: 126898 ns, Reaching safepoint: 3584983 ns, At safepoint: 9238356
93 ns, Total: 927420676 ns
[2023-02-22T21:05:11.682+0000][7][gc,marking     ] GC(749071) Concurrent Mark From Roots 1203.205ms
[2023-02-22T21:05:11.682+0000][7][gc,marking     ] GC(749071) Concurrent Mark Abort
[2023-02-22T21:05:11.682+0000][7][gc             ] GC(749071) Concurrent Mark Cycle 1212.829ms
[2023-02-22T21:05:11.698+0000][7][gc,start       ] GC(749077) Pause Young (Normal) (G1 Preventive Collection)
[2023-02-22T21:05:11.698+0000][7][gc,task        ] GC(749077) Using 43 workers of 43 for evacuation
[2023-02-22T21:05:11.698+0000][7][gc,age         ] GC(749077) Desired survivor size 27262976 bytes, new threshold 15 (max threshold 15)
[2023-02-22T21:05:11.747+0000][7][gc,age         ] GC(749077) Age table with threshold 15 (max threshold 15)
[2023-02-22T21:05:11.747+0000][7][gc,age         ] GC(749077) - age   1:   54319424 bytes,   54319424 total
[2023-02-22T21:05:11.747+0000][7][gc             ] GC(749077) To-space exhausted
[2023-02-22T21:05:11.747+0000][7][gc,phases      ] GC(749077)   Pre Evacuate Collection Set: 1.2ms
[2023-02-22T21:05:11.747+0000][7][gc,phases      ] GC(749077)   Merge Heap Roots: 0.3ms
[2023-02-22T21:05:11.747+0000][7][gc,phases      ] GC(749077)   Evacuate Collection Set: 40.6ms
[2023-02-22T21:05:11.747+0000][7][gc,phases      ] GC(749077)   Post Evacuate Collection Set: 6.5ms
[2023-02-22T21:05:11.747+0000][7][gc,phases      ] GC(749077)   Other: 0.5ms
[2023-02-22T21:05:11.747+0000][7][gc,heap        ] GC(749077) Eden regions: 84->0(89)
[2023-02-22T21:05:11.747+0000][7][gc,heap        ] GC(749077) Survivor regions: 0->13(13)
[2023-02-22T21:05:11.747+0000][7][gc,heap        ] GC(749077) Old regions: 1802->1922
[2023-02-22T21:05:11.747+0000][7][gc,heap        ] GC(749077) Archive regions: 2->2
[2023-02-22T21:05:11.747+0000][7][gc,heap        ] GC(749077) Humongous regions: 105->105
[2023-02-22T21:05:11.747+0000][7][gc,metaspace   ] GC(749077) Metaspace: 132894K(134848K)->132894K(134848K) NonClass: 115859K(116992K)->115859K(116992K) Class: 17035K(1
7856K)->17035K(17856K)
[2023-02-22T21:05:11.747+0000][7][gc             ] GC(749077) Pause Young (Normal) (G1 Preventive Collection) 7803M->7999M(8192M) 49.110ms
[2023-02-22T21:05:11.747+0000][7][gc,cpu         ] GC(749077) User=0.30s Sys=0.04s Real=0.05s
[2023-02-22T21:05:11.751+0000][7][safepoint      ] Safepoint "G1CollectForAllocation", Time since last: 17182249 ns, Reaching safepoint: 815173 ns, At safepoint: 525793
97 ns, Total: 53394570 ns

from: target logs
{"@timestamp":"2023-02-22T21:05:10.727Z", "log.level": "INFO", "message":"attempting to trigger G1GC due to high heap usage [8540251320]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[es_data_ssd_2_3][transport_worker][T#10]","log.logger":"org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService","elasticsearch.cluster.uuid":"XDEw48F5SEu3KcS3_jDNcw","elasticsearch.node.id":"YEzEpdGpT7iECEXqVTVhDQ","elasticsearch.node.name":"es_data_ssd_2_3","elasticsearch.cluster.name":"elk_cluster"}

Nope.

What is the output from the _cluster/stats?pretty&human API?
Why does your data node have less heap than your ingest node?

in this cluster data ingest node are connecting with loadbalancer for getting the data, memory was increased for avoid "data too large", I realized that I need to merge the data node for insure high heap size of mem. But the question was how we can control data content size on the cluster side(other words in the interconnect node layer <- > transport layer)

{
  "_nodes" : {
    "total" : 45,
    "successful" : 45,
    "failed" : 0
  },
  "cluster_name" : "elk_cluster",
  "cluster_uuid" : "XDEw48F5SEu3KcS3_jDNcw",
  "timestamp" : 1677232168236,
  "status" : "green",
  "indices" : {
    "count" : 4039,
    "shards" : {
      "total" : 13595,
      "primaries" : 10706,
      "replication" : 0.2698486829815057,
      "index" : {
        "shards" : {
          "min" : 1,
          "max" : 20,
          "avg" : 3.3659321614260955
        },
        "primaries" : {
          "min" : 1,
          "max" : 20,
          "avg" : 2.650656102995791
        },
        "replication" : {
          "min" : 0.0,
          "max" : 1.0,
          "avg" : 0.6256499133448874
        }
      }
    },
    "docs" : {
      "count" : 87257267381,
      "deleted" : 13793949
    },
    "store" : {
      "size" : "35.4tb",
      "size_in_bytes" : 38936309373261,
      "total_data_set_size" : "35.4tb",
      "total_data_set_size_in_bytes" : 38936309373261,
      "reserved" : "0b",
      "reserved_in_bytes" : 0
    },
    "fielddata" : {
      "memory_size" : "17.1gb",
      "memory_size_in_bytes" : 18464876752,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size" : "7.5gb",
      "memory_size_in_bytes" : 8086785506,
      "total_count" : 476852189,
      "hit_count" : 13799137,
      "miss_count" : 463053052,
      "cache_size" : 235496,
      "cache_count" : 426545,
      "evictions" : 191049
    },
    "completion" : {
      "size" : "0b",
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 147147,
      "memory" : "0b",
      "memory_in_bytes" : 0,
      "terms_memory" : "0b",
      "terms_memory_in_bytes" : 0,
      "stored_fields_memory" : "0b",
      "stored_fields_memory_in_bytes" : 0,
      "term_vectors_memory" : "0b",
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory" : "0b",
      "norms_memory_in_bytes" : 0,
      "points_memory" : "0b",
      "points_memory_in_bytes" : 0,
      "doc_values_memory" : "0b",
      "doc_values_memory_in_bytes" : 0,
      "index_writer_memory" : "4gb",
      "index_writer_memory_in_bytes" : 4342523290,
      "version_map_memory" : "28.6mb",
      "version_map_memory_in_bytes" : 30039272,
      "fixed_bit_set" : "67.6mb",
      "fixed_bit_set_memory_in_bytes" : 70952208,
      "max_unsafe_auto_id_timestamp" : 1677230181092,
      "file_sizes" : { }
    },
    "mappings" : {
      "field_types" : [
        {
          "name" : "alias",
          "count" : 117,
          "index_count" : 9,
          "script_count" : 0
        },
        {
          "name" : "binary",
          "count" : 2,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "boolean",
          "count" : 1387,
          "index_count" : 85,
          "script_count" : 0
        },
        {
          "name" : "byte",
          "count" : 225,
          "index_count" : 41,
          "script_count" : 0
        },
        {
          "name" : "constant_keyword",
          "count" : 33,
          "index_count" : 11,
          "script_count" : 0
        },
        {
          "name" : "date",
          "count" : 7814,
          "index_count" : 4000,
          "script_count" : 0
        },
        {
          "name" : "date_nanos",
          "count" : 2,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "date_range",
          "count" : 2,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "double",
          "count" : 396,
          "index_count" : 18,
          "script_count" : 0
        },
        {
          "name" : "double_range",
          "count" : 2,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "flattened",
          "count" : 391,
          "index_count" : 9,
          "script_count" : 0
        },
        {
          "name" : "float",
          "count" : 737465,
          "index_count" : 966,
          "script_count" : 0
        },
        {
          "name" : "float_range",
          "count" : 2,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "geo_point",
          "count" : 1241,
          "index_count" : 1050,
          "script_count" : 0
        },
        {
          "name" : "geo_shape",
          "count" : 2,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "half_float",
          "count" : 2027,
          "index_count" : 997,
          "script_count" : 0
        },
        {
          "name" : "integer",
          "count" : 419,
          "index_count" : 45,
          "script_count" : 0
        },
        {
          "name" : "integer_range",
          "count" : 2,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "ip",
          "count" : 2886,
          "index_count" : 1107,
          "script_count" : 0
        },
        {
          "name" : "ip_range",
          "count" : 2,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "keyword",
          "count" : 89675,
          "index_count" : 4002,
          "script_count" : 0
        },
        {
          "name" : "long",
          "count" : 4674086,
          "index_count" : 2930,
          "script_count" : 0
        },
        {
          "name" : "long_range",
          "count" : 2,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "match_only_text",
          "count" : 567,
          "index_count" : 9,
          "script_count" : 0
        },
        {
          "name" : "nested",
          "count" : 206,
          "index_count" : 27,
          "script_count" : 0
        },
        {
          "name" : "object",
          "count" : 1318527,
          "index_count" : 3209,
          "script_count" : 0
        },
        {
          "name" : "scaled_float",
          "count" : 13,
          "index_count" : 9,
          "script_count" : 0
        },
        {
          "name" : "shape",
          "count" : 2,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "short",
          "count" : 2502,
          "index_count" : 139,
          "script_count" : 0
        },
        {
          "name" : "text",
          "count" : 30085,
          "index_count" : 2683,
          "script_count" : 0
        },
        {
          "name" : "version",
          "count" : 4,
          "index_count" : 4,
          "script_count" : 0
        },
        {
          "name" : "wildcard",
          "count" : 153,
          "index_count" : 9,
          "script_count" : 0
        }
      ],
      "runtime_field_types" : [ ]
    },
    "analysis" : {
      "char_filter_types" : [ ],
      "tokenizer_types" : [ ],
      "filter_types" : [ ],
      "analyzer_types" : [ ],
      "built_in_char_filters" : [ ],
      "built_in_tokenizers" : [ ],
      "built_in_filters" : [ ],
      "built_in_analyzers" : [ ]
    },
    "versions" : [
      {
        "version" : "8.1.0",
        "index_count" : 4039,
        "primary_shard_count" : 10706,
        "total_primary_size" : "20.5tb",
        "total_primary_bytes" : 22624821090317
      }
    ]
  },
  "nodes" : {
    "count" : {
      "total" : 45,
      "coordinating_only" : 0,
      "data" : 0,
      "data_cold" : 0,
      "data_content" : 15,
      "data_frozen" : 0,
      "data_hot" : 15,
      "data_warm" : 21,
      "ingest" : 3,
      "master" : 6,
      "ml" : 0,
      "remote_cluster_client" : 0,
      "transform" : 0,
      "voting_only" : 3
    },
    "versions" : [
      "8.1.0"
    ],
    "os" : {
      "available_processors" : 2880,
      "allocated_processors" : 2880,
      "names" : [
        {
          "name" : "Linux",
          "count" : 45
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "Ubuntu 20.04.4 LTS",
          "count" : 45
        }
      ],
      "architectures" : [
        {
          "arch" : "amd64",
          "count" : 45
        }
      ],
      "mem" : {
        "total" : "744gb",
        "total_in_bytes" : 798863917056,
        "adjusted_total" : "744gb",
        "adjusted_total_in_bytes" : 798863917056,
        "free" : "112.6gb",
        "free_in_bytes" : 120939143168,
        "used" : "631.3gb",
        "used_in_bytes" : 677924773888,
        "free_percent" : 15,
        "used_percent" : 85
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 145
      },
      "open_file_descriptors" : {
        "min" : 1463,
        "max" : 8342,
        "avg" : 4524
      }
    },
    "jvm" : {
      "max_uptime" : "17d",
      "max_uptime_in_millis" : 1473803846,
      "versions" : [
        {
          "version" : "17.0.2",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "17.0.2+8",
          "vm_vendor" : "Eclipse Adoptium",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 45
        }
      ],
      "mem" : {
        "heap_used" : "237.3gb",
        "heap_used_in_bytes" : 254865265632,
        "heap_max" : "372gb",
        "heap_max_in_bytes" : 399431958528
      },
      "threads" : 12682
    },
    "fs" : {
      "total" : "142.7tb",
      "total_in_bytes" : 156918447169536,
      "free" : "91.1tb",
      "free_in_bytes" : 100248977702912,
      "available" : "91.1tb",
      "available_in_bytes" : 100248977702912
    },
    "plugins" : [ ],
    "network_types" : {
      "transport_types" : {
        "security4" : 45
      },
      "http_types" : {
        "security4" : 45
      }
    },
    "discovery_types" : {
      "multi-node" : 45
    },
    "packaging_types" : [
      {
        "flavor" : "default",
        "type" : "docker",
        "count" : 45
      }
    ],
    "ingest" : {
      "number_of_pipelines" : 10,
      "processor_stats" : {
        "append" : {
          "count" : 40387877,
          "failed" : 0,
          "current" : 0,
          "time" : "20.3s",
          "time_in_millis" : 20308
        },
        "conditional" : {
          "count" : 48929195,
          "failed" : 0,
          "current" : 0,
          "time" : "10.1m",
          "time_in_millis" : 607052
        },
        "convert" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "date" : {
          "count" : 40387877,
          "failed" : 0,
          "current" : 0,
          "time" : "6.2m",
          "time_in_millis" : 375407
        },
        "geoip" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "grok" : {
          "count" : 197291501,
          "failed" : 59945293,
          "current" : 0,
          "time" : "4.5h",
          "time_in_millis" : 16231720
        },
        "json" : {
          "count" : 43064847,
          "failed" : 2676970,
          "current" : 0,
          "time" : "21.9m",
          "time_in_millis" : 1316633
        },
        "remove" : {
          "count" : 163498225,
          "failed" : 48929195,
          "current" : 0,
          "time" : "9.3m",
          "time_in_millis" : 562001
        },
        "rename" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "script" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "set" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "set_security_user" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "uri_parts" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "user_agent" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        }
      }
    },
    "indexing_pressure" : {
      "memory" : {
        "current" : {
          "combined_coordinating_and_primary" : "0b",
          "combined_coordinating_and_primary_in_bytes" : 0,
          "coordinating" : "0b",
          "coordinating_in_bytes" : 0,
          "primary" : "0b",
          "primary_in_bytes" : 0,
          "replica" : "0b",
          "replica_in_bytes" : 0,
          "all" : "0b",
          "all_in_bytes" : 0
        },
        "total" : {
          "combined_coordinating_and_primary" : "0b",
          "combined_coordinating_and_primary_in_bytes" : 0,
          "coordinating" : "0b",
          "coordinating_in_bytes" : 0,
          "primary" : "0b",
          "primary_in_bytes" : 0,
          "replica" : "0b",
          "replica_in_bytes" : 0,
          "all" : "0b",
          "all_in_bytes" : 0,
          "coordinating_rejections" : 0,
          "primary_rejections" : 0,
          "replica_rejections" : 0
        },
        "limit" : "0b",
        "limit_in_bytes" : 0
      }
    }
  }
}

That is not something you need to control. The transport message size was tiny and not a problem:

new bytes reserved: [316/316b]

The issue is caused by everything else that uses memory on this node.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.