CircuitBreakingException: [parent] Data too large

Hi All,

Today morning, I got an error while checking kibana discover stating that its unable to fetch the result. Then I checked Stack Monitoring and found the below error --

image

On further investigation, I found that out of 3 elasticsearch nodes, 1 node was down and another was heavily loaded. So I restarted the node which was down and then also the issue was persisting --

After that, I restarted the other two nodes, and then the issue was gone.

Now, further checking the elasticsearch cluster log, I found several entries for CircuitBreakingException: [parent] Data too large` error which was occurring since yesterday.

Please see below some of the snippets found from the log --

[2022-11-29T12:11:34,379][ERROR][o.e.x.s.TransportSubmitAsyncSearchAction] [node-1] failed to store async-search [FkhrOHRHdTFIUldDOHNrNVpPM1J2RHcfMFEwX0xkUnpReS1KZEo2cjFYdHFaQTo0NjE2NTE3Nw==]
org.elasticsearch.transport.RemoteTransportException: [node-2][172.31.8.228:9300][indices:data/write/update[s]]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [indices:data/write/update[s]] would be [8183624210/7.6gb], which is larger than the limit of [8160437862/7.5gb], real usage: [8180977664/7.6gb], new bytes reserved: [2646546/2.5mb], usages [request=592060416/564.6mb, fielddata=1063089042/1013.8mb, in_flight_requests=2667158/2.5mb, model_inference=0/0b, accounting=10484420/9.9mb]
[2022-11-29T12:11:34,584][ERROR][o.e.x.c.a.AsyncResultsService] [node-1] failed to update expiration time for async-search [FkhrOHRHdTFIUldDOHNrNVpPM1J2RHcfMFEwX0xkUnpReS1KZEo2cjFYdHFaQTo0NjE2NTE3Nw==]
org.elasticsearch.transport.RemoteTransportException: [node-2][172.31.8.228:9300][indices:data/write/update[s]]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [indices:data/write/update[s]] would be [8239698386/7.6gb], which is larger than the limit of [8160437862/7.5gb], real usage: [8239697920/7.6gb], new bytes reserved: [466/466b], usages [request=592060416/564.6mb, fielddata=1063089042/1013.8mb, in_flight_requests=21544/21kb, model_inference=0/0b, accounting=10494204/10mb]
[2022-11-29T12:11:35,134][ERROR][o.e.x.c.a.DeleteAsyncResultsService] [node-1] failed to clean async result [FkhBVFVEMXZoVHR1MUFUMzdzMzJZVHcfMFEwX0xkUnpReS1KZEo2cjFYdHFaQTo0NjE0OTYxMw==]
org.elasticsearch.transport.RemoteTransportException: [node-2][172.31.8.228:9300][indices:data/write/bulk[s]]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [indices:data/write/bulk[s]] would be [8357138758/7.7gb], which is larger than the limit of [8160437862/7.5gb], real usage: [8357138432/7.7gb], new bytes reserved: [326/326b], usages [request=592060416/564.6mb, fielddata=1063089042/1013.8mb, in_flight_requests=20938/20.4kb, model_inference=0/0b, accounting=10494204/10mb]

Then I found this --

[2022-11-29T19:51:53,924][INFO ][o.e.c.c.Coordinator      ] [node-1] master node [{node-2}{XhSgnpKfQC-R8Ie74MU2XA}{VdwHQpY5SAK6o3cNYR-y0w}{172.31.8.228}{172.31.8.228:9300}{cdfhilmrstw}{ml.machine_memory=33675792384, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=8589934592, transform.node=true}] failed, restarting discovery
org.elasticsearch.ElasticsearchException: node [{node-2}{XhSgnpKfQC-R8Ie74MU2XA}{VdwHQpY5SAK6o3cNYR-y0w}{172.31.8.228}{172.31.8.228:9300}{cdfhilmrstw}{ml.machine_memory=33675792384, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=8589934592, transform.node=true}] failed [3] consecutive checks
	at org.elasticsearch.cluster.coordination.LeaderChecker$CheckScheduler$1.handleException(LeaderChecker.java:275) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1283) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1184) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:673) [elasticsearch-7.13.0.jar:7.13.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
	at java.lang.Thread.run(Thread.java:831) [?:?]
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: [node-2][172.31.8.228:9300][internal:coordination/fault_detection/leader_check] request_id [60334983] timed out after [10007ms]
	at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1185) ~[elasticsearch-7.13.0.jar:7.13.0]
	... 4 more
[2022-11-29T19:51:53,926][INFO ][o.e.c.s.ClusterApplierService] [node-1] master node changed {previous [{node-2}{XhSgnpKfQC-R8Ie74MU2XA}{VdwHQpY5SAK6o3cNYR-y0w}{172.31.8.228}{172.31.8.228:9300}{cdfhilmrstw}], current []}, term: 25, version: 21685, reason: becoming candidate: onLeaderFailure
[2022-11-29T19:51:53,929][INFO ][o.e.x.w.WatcherService   ] [node-1] paused watch execution, reason [no master node], cancelled [0] queued tasks
[2022-11-29T19:51:54,070][INFO ][o.e.c.s.MasterService    ] [node-1] elected-as-master ([2] nodes joined)[{node-3}{CGRUQiYQRp6wANZ3-nQflA}{7G80d_GOTtCkLQOTLeZN6Q}{172.31.1.110}{172.31.1.110:9300}{cdfhilmrstw} elect leader, {node-1}{0Q0_LdRzQy-JdJ6r1XtqZA}{1VSUYPTwTzCsmy_c0_XpSQ}{172.31.6.214}{172.31.6.214:9300}{cdfhilmrstw} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_], term: 26, version: 21686, delta: master node changed {previous [], current [{node-1}{0Q0_LdRzQy-JdJ6r1XtqZA}{1VSUYPTwTzCsmy_c0_XpSQ}{172.31.6.214}{172.31.6.214:9300}{cdfhilmrstw}]}
[2022-11-29T19:52:03,943][WARN ][o.e.t.OutboundHandler    ] [node-1] send message failed [channel: Netty4TcpChannel{localAddress=/172.31.6.214:47586, remoteAddress=172.31.8.228/172.31.8.228:9300, profile=default}]
io.netty.handler.ssl.SslHandshakeTimeoutException: handshake timed out after 10000ms

ELK stack info --

ELK Version - 7.11.1
Subscription - Platinum
ES Nodes - 3
Node config - Disk space - 1 TB, Memory - 32 GB, Cores - 8
Disk Available - 94.11% (Current) | Total Size - 2.8 TB
JVM Heap - 42.97% (Current) | Total heap 24 GB (8 GB for each node)
Indices - 189
Documents - 290,418,633
Disk Usage - 127.1 GB
Primary Shards - 189
Replica Shards - 189
Machine Learning Job - 19
All 3 Nodes are AWS servers
Kibana and logstash reside in separate AWS servers
Watcher enabled

A few weeks back I noticed a similar incident in our ELK environment.

Can you please help me to find the root cause of the issue? Since it's a production stack, these events cause a serious impact on client monitoring.

Also, I was unable to find the reason why and when node 2 went down. How can I find that?

Regards,
Souvik

What is the output from the _cluster/stats?pretty&human API? It'll help us get a better idea of your use.

Hi @warkolm ,

Thanks for your quick response.

Please see below as requested --

ubuntu@ip-XXX-XX-X-XXX:~$ curl -u user:password  -X GET "localhost:9200/_cluster/stats?pretty"
{
  "_nodes" : {
    "total" : 3,
    "successful" : 3,
    "failed" : 0
  },
  "cluster_name" : "name-cluster",
  "cluster_uuid" : "XXXXXXXXXXXX",
  "timestamp" : 1669873911192,
  "status" : "green",
  "indices" : {
    "count" : 189,
    "shards" : {
      "total" : 378,
      "primaries" : 189,
      "replication" : 1.0,
      "index" : {
        "shards" : {
          "min" : 2,
          "max" : 2,
          "avg" : 2.0
        },
        "primaries" : {
          "min" : 1,
          "max" : 1,
          "avg" : 1.0
        },
        "replication" : {
          "min" : 1.0,
          "max" : 1.0,
          "avg" : 1.0
        }
      }
    },
    "docs" : {
      "count" : 290441402,
      "deleted" : 16491694
    },
    "store" : {
      "size_in_bytes" : 136364429460,
      "total_data_set_size_in_bytes" : 136364429460,
      "reserved_in_bytes" : 0
    },
    "fielddata" : {
      "memory_size_in_bytes" : 171043440,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size_in_bytes" : 816163347,
      "total_count" : 174054646,
      "hit_count" : 20037939,
      "miss_count" : 154016707,
      "cache_size" : 52108,
      "cache_count" : 80399,
      "evictions" : 28291
    },
    "completion" : {
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 3043,
      "memory_in_bytes" : 30017804,
      "terms_memory_in_bytes" : 14772664,
      "stored_fields_memory_in_bytes" : 1581560,
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory_in_bytes" : 1397056,
      "points_memory_in_bytes" : 0,
      "doc_values_memory_in_bytes" : 12266524,
      "index_writer_memory_in_bytes" : 160180772,
      "version_map_memory_in_bytes" : 26837642,
      "fixed_bit_set_memory_in_bytes" : 3624264,
      "max_unsafe_auto_id_timestamp" : 1669852804058,
      "file_sizes" : { }
    },
    "mappings" : {
      "field_types" : [
        {
          "name" : "binary",
          "count" : 1,
          "index_count" : 1,
          "script_count" : 0
        },
        {
          "name" : "boolean",
          "count" : 308,
          "index_count" : 23,
          "script_count" : 0
        },
        {
          "name" : "byte",
          "count" : 1,
          "index_count" : 1,
          "script_count" : 0
        },
        {
          "name" : "constant_keyword",
          "count" : 2,
          "index_count" : 1,
          "script_count" : 0
        },
        {
          "name" : "date",
          "count" : 299,
          "index_count" : 101,
          "script_count" : 0
        },
        {
          "name" : "date_nanos",
          "count" : 1,
          "index_count" : 1,
          "script_count" : 0
        },
        {
          "name" : "date_range",
          "count" : 1,
          "index_count" : 1,
          "script_count" : 0
        },
        {
          "name" : "double",
          "count" : 45,
          "index_count" : 3,
          "script_count" : 0
        },
        {
          "name" : "double_range",
          "count" : 1,
          "index_count" : 1,
          "script_count" : 0
        },
        {
          "name" : "float",
          "count" : 56,
          "index_count" : 21,
          "script_count" : 0
        },
        {
          "name" : "float_range",
          "count" : 1,
          "index_count" : 1,
          "script_count" : 0
        },
        {
          "name" : "geo_point",
          "count" : 5,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "geo_shape",
          "count" : 1,
          "index_count" : 1,
          "script_count" : 0
        },
        {
          "name" : "half_float",
          "count" : 36,
          "index_count" : 8,
          "script_count" : 0
        },
        {
          "name" : "integer",
          "count" : 173,
          "index_count" : 18,
          "script_count" : 0
        },
        {
          "name" : "integer_range",
          "count" : 1,
          "index_count" : 1,
          "script_count" : 0
        },
        {
          "name" : "ip",
          "count" : 2,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "ip_range",
          "count" : 1,
          "index_count" : 1,
          "script_count" : 0
        },
        {
          "name" : "keyword",
          "count" : 2651,
          "index_count" : 102,
          "script_count" : 0
        },
        {
          "name" : "long",
          "count" : 1441,
          "index_count" : 87,
          "script_count" : 0
        },
        {
          "name" : "long_range",
          "count" : 1,
          "index_count" : 1,
          "script_count" : 0
        },
        {
          "name" : "nested",
          "count" : 63,
          "index_count" : 19,
          "script_count" : 0
        },
        {
          "name" : "object",
          "count" : 2327,
          "index_count" : 48,
          "script_count" : 0
        },
        {
          "name" : "shape",
          "count" : 1,
          "index_count" : 1,
          "script_count" : 0
        },
        {
          "name" : "short",
          "count" : 2,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "text",
          "count" : 1828,
          "index_count" : 80,
          "script_count" : 0
        }
      ],
      "runtime_field_types" : [ ]
    },
    "analysis" : {
      "char_filter_types" : [ ],
      "tokenizer_types" : [ ],
      "filter_types" : [ ],
      "analyzer_types" : [ ],
      "built_in_char_filters" : [ ],
      "built_in_tokenizers" : [ ],
      "built_in_filters" : [ ],
      "built_in_analyzers" : [
        {
          "name" : "whitespace",
          "count" : 1,
          "index_count" : 1
        }
      ]
    },
    "versions" : [
      {
        "version" : "7.13.0",
        "index_count" : 189,
        "primary_shard_count" : 189,
        "total_primary_bytes" : 68251801341
      }
    ]
  },
  "nodes" : {
    "count" : {
      "total" : 3,
      "coordinating_only" : 0,
      "data" : 3,
      "data_cold" : 3,
      "data_content" : 3,
      "data_frozen" : 3,
      "data_hot" : 3,
      "data_warm" : 3,
      "ingest" : 3,
      "master" : 3,
      "ml" : 3,
      "remote_cluster_client" : 3,
      "transform" : 3,
      "voting_only" : 0
    },
    "versions" : [
      "7.13.0"
    ],
    "os" : {
      "available_processors" : 24,
      "allocated_processors" : 24,
      "names" : [
        {
          "name" : "Linux",
          "count" : 3
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "Ubuntu 18.04.5 LTS",
          "count" : 3
        }
      ],
      "architectures" : [
        {
          "arch" : "amd64",
          "count" : 3
        }
      ],
      "mem" : {
        "total_in_bytes" : 101027368960,
        "free_in_bytes" : 12227616768,
        "used_in_bytes" : 88799752192,
        "free_percent" : 12,
        "used_percent" : 88
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 6
      },
      "open_file_descriptors" : {
        "min" : 1594,
        "max" : 2031,
        "avg" : 1770
      }
    },
    "jvm" : {
      "max_uptime_in_millis" : 81473324,
      "versions" : [
        {
          "version" : "16",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "16+36",
          "vm_vendor" : "AdoptOpenJDK",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 3
        }
      ],
      "mem" : {
        "heap_used_in_bytes" : 11819577384,
        "heap_max_in_bytes" : 25769803776
      },
      "threads" : 539
    },
    "fs" : {
      "total_in_bytes" : 3122490912768,
      "free_in_bytes" : 2937848168448,
      "available_in_bytes" : 2937797836800
    },
    "plugins" : [ ],
    "network_types" : {
      "transport_types" : {
        "security4" : 3
      },
      "http_types" : {
        "security4" : 3
      }
    },
    "discovery_types" : {
      "zen" : 3
    },
    "packaging_types" : [
      {
        "flavor" : "default",
        "type" : "deb",
        "count" : 3
      }
    ],
    "ingest" : {
      "number_of_pipelines" : 18,
      "processor_stats" : {
        "conditional" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "geoip" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "grok" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "gsub" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "remove" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "rename" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "script" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "set" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        }
      }
    }
  }
}

Waiting for your response.

Regards,
Souvik

Hi @warkolm, Kindly look into this.

Are all your nodes on the same network?

You really need to upgrade your version as 7.13 is EOL and not supported. Upgrade your JVM as well :slight_smile:

Hi @warkolm , Yes, all the nodes are on the same network. Any clue?

Regarding the EOL,
Our subscription is valid till May 2023. If we upgrade it to a newer version now, do we need to download the license file again for the newer version?

Can you please provide some references to upgrade it from 7.13 to 8.X?

Regards,
Souvik

If you have a subscription then you should really be chatting to Support directly about this issue :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.