CircuitBreakingException: [parent] Data too large

Souvik_Das · November 30, 2022, 1:17pm

Hi All,

Today morning, I got an error while checking kibana discover stating that its unable to fetch the result. Then I checked Stack Monitoring and found the below error --

On further investigation, I found that out of 3 elasticsearch nodes, 1 node was down and another was heavily loaded. So I restarted the node which was down and then also the issue was persisting --

After that, I restarted the other two nodes, and then the issue was gone.

Now, further checking the elasticsearch cluster log, I found several entries for CircuitBreakingException: [parent] Data too large` error which was occurring since yesterday.

Please see below some of the snippets found from the log --

[2022-11-29T12:11:34,379][ERROR][o.e.x.s.TransportSubmitAsyncSearchAction] [node-1] failed to store async-search [FkhrOHRHdTFIUldDOHNrNVpPM1J2RHcfMFEwX0xkUnpReS1KZEo2cjFYdHFaQTo0NjE2NTE3Nw==]
org.elasticsearch.transport.RemoteTransportException: [node-2][172.31.8.228:9300][indices:data/write/update[s]]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [indices:data/write/update[s]] would be [8183624210/7.6gb], which is larger than the limit of [8160437862/7.5gb], real usage: [8180977664/7.6gb], new bytes reserved: [2646546/2.5mb], usages [request=592060416/564.6mb, fielddata=1063089042/1013.8mb, in_flight_requests=2667158/2.5mb, model_inference=0/0b, accounting=10484420/9.9mb]

[2022-11-29T12:11:34,584][ERROR][o.e.x.c.a.AsyncResultsService] [node-1] failed to update expiration time for async-search [FkhrOHRHdTFIUldDOHNrNVpPM1J2RHcfMFEwX0xkUnpReS1KZEo2cjFYdHFaQTo0NjE2NTE3Nw==]
org.elasticsearch.transport.RemoteTransportException: [node-2][172.31.8.228:9300][indices:data/write/update[s]]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [indices:data/write/update[s]] would be [8239698386/7.6gb], which is larger than the limit of [8160437862/7.5gb], real usage: [8239697920/7.6gb], new bytes reserved: [466/466b], usages [request=592060416/564.6mb, fielddata=1063089042/1013.8mb, in_flight_requests=21544/21kb, model_inference=0/0b, accounting=10494204/10mb]

[2022-11-29T12:11:35,134][ERROR][o.e.x.c.a.DeleteAsyncResultsService] [node-1] failed to clean async result [FkhBVFVEMXZoVHR1MUFUMzdzMzJZVHcfMFEwX0xkUnpReS1KZEo2cjFYdHFaQTo0NjE0OTYxMw==]
org.elasticsearch.transport.RemoteTransportException: [node-2][172.31.8.228:9300][indices:data/write/bulk[s]]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [indices:data/write/bulk[s]] would be [8357138758/7.7gb], which is larger than the limit of [8160437862/7.5gb], real usage: [8357138432/7.7gb], new bytes reserved: [326/326b], usages [request=592060416/564.6mb, fielddata=1063089042/1013.8mb, in_flight_requests=20938/20.4kb, model_inference=0/0b, accounting=10494204/10mb]

Then I found this --

[2022-11-29T19:51:53,924][INFO ][o.e.c.c.Coordinator      ] [node-1] master node [{node-2}{XhSgnpKfQC-R8Ie74MU2XA}{VdwHQpY5SAK6o3cNYR-y0w}{172.31.8.228}{172.31.8.228:9300}{cdfhilmrstw}{ml.machine_memory=33675792384, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=8589934592, transform.node=true}] failed, restarting discovery
org.elasticsearch.ElasticsearchException: node [{node-2}{XhSgnpKfQC-R8Ie74MU2XA}{VdwHQpY5SAK6o3cNYR-y0w}{172.31.8.228}{172.31.8.228:9300}{cdfhilmrstw}{ml.machine_memory=33675792384, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=8589934592, transform.node=true}] failed [3] consecutive checks
	at org.elasticsearch.cluster.coordination.LeaderChecker$CheckScheduler$1.handleException(LeaderChecker.java:275) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1283) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1184) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:673) [elasticsearch-7.13.0.jar:7.13.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
	at java.lang.Thread.run(Thread.java:831) [?:?]
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: [node-2][172.31.8.228:9300][internal:coordination/fault_detection/leader_check] request_id [60334983] timed out after [10007ms]
	at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1185) ~[elasticsearch-7.13.0.jar:7.13.0]
	... 4 more
[2022-11-29T19:51:53,926][INFO ][o.e.c.s.ClusterApplierService] [node-1] master node changed {previous [{node-2}{XhSgnpKfQC-R8Ie74MU2XA}{VdwHQpY5SAK6o3cNYR-y0w}{172.31.8.228}{172.31.8.228:9300}{cdfhilmrstw}], current []}, term: 25, version: 21685, reason: becoming candidate: onLeaderFailure
[2022-11-29T19:51:53,929][INFO ][o.e.x.w.WatcherService   ] [node-1] paused watch execution, reason [no master node], cancelled [0] queued tasks
[2022-11-29T19:51:54,070][INFO ][o.e.c.s.MasterService    ] [node-1] elected-as-master ([2] nodes joined)[{node-3}{CGRUQiYQRp6wANZ3-nQflA}{7G80d_GOTtCkLQOTLeZN6Q}{172.31.1.110}{172.31.1.110:9300}{cdfhilmrstw} elect leader, {node-1}{0Q0_LdRzQy-JdJ6r1XtqZA}{1VSUYPTwTzCsmy_c0_XpSQ}{172.31.6.214}{172.31.6.214:9300}{cdfhilmrstw} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_], term: 26, version: 21686, delta: master node changed {previous [], current [{node-1}{0Q0_LdRzQy-JdJ6r1XtqZA}{1VSUYPTwTzCsmy_c0_XpSQ}{172.31.6.214}{172.31.6.214:9300}{cdfhilmrstw}]}
[2022-11-29T19:52:03,943][WARN ][o.e.t.OutboundHandler    ] [node-1] send message failed [channel: Netty4TcpChannel{localAddress=/172.31.6.214:47586, remoteAddress=172.31.8.228/172.31.8.228:9300, profile=default}]
io.netty.handler.ssl.SslHandshakeTimeoutException: handshake timed out after 10000ms

ELK stack info --

ELK Version - 7.11.1
Subscription - Platinum
ES Nodes - 3
Node config - Disk space - 1 TB, Memory - 32 GB, Cores - 8
Disk Available - 94.11% (Current) | Total Size - 2.8 TB
JVM Heap - 42.97% (Current) | Total heap 24 GB (8 GB for each node)
Indices - 189
Documents - 290,418,633
Disk Usage - 127.1 GB
Primary Shards - 189
Replica Shards - 189
Machine Learning Job - 19
All 3 Nodes are AWS servers
Kibana and logstash reside in separate AWS servers
Watcher enabled

A few weeks back I noticed a similar incident in our ELK environment.

Can you please help me to find the root cause of the issue? Since it's a production stack, these events cause a serious impact on client monitoring.

Also, I was unable to find the reason why and when node 2 went down. How can I find that?

Regards,
Souvik

warkolm · December 1, 2022, 1:31am

What is the output from the _cluster/stats?pretty&human API? It'll help us get a better idea of your use.

Souvik_Das · December 1, 2022, 6:08am

Hi @warkolm ,

Thanks for your quick response.

Please see below as requested --

ubuntu@ip-XXX-XX-X-XXX:~$ curl -u user:password  -X GET "localhost:9200/_cluster/stats?pretty"
{
  "_nodes" : {
    "total" : 3,
    "successful" : 3,
    "failed" : 0
  },
  "cluster_name" : "name-cluster",
  "cluster_uuid" : "XXXXXXXXXXXX",
  "timestamp" : 1669873911192,
  "status" : "green",
  "indices" : {
    "count" : 189,
    "shards" : {
      "total" : 378,
      "primaries" : 189,
      "replication" : 1.0,
      "index" : {
        "shards" : {
          "min" : 2,
          "max" : 2,
          "avg" : 2.0
        },
        "primaries" : {
          "min" : 1,
          "max" : 1,
          "avg" : 1.0
        },
        "replication" : {
          "min" : 1.0,
          "max" : 1.0,
          "avg" : 1.0
        }
      }
    },
    "docs" : {
      "count" : 290441402,
      "deleted" : 16491694
    },
    "store" : {
      "size_in_bytes" : 136364429460,
      "total_data_set_size_in_bytes" : 136364429460,
      "reserved_in_bytes" : 0
    },
    "fielddata" : {
      "memory_size_in_bytes" : 171043440,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size_in_bytes" : 816163347,
      "total_count" : 174054646,
      "hit_count" : 20037939,
      "miss_count" : 154016707,
      "cache_size" : 52108,
      "cache_count" : 80399,
      "evictions" : 28291
    },
    "completion" : {
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 3043,
      "memory_in_bytes" : 30017804,
      "terms_memory_in_bytes" : 14772664,
      "stored_fields_memory_in_bytes" : 1581560,
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory_in_bytes" : 1397056,
      "points_memory_in_bytes" : 0,
      "doc_values_memory_in_bytes" : 12266524,
      "index_writer_memory_in_bytes" : 160180772,
      "version_map_memory_in_bytes" : 26837642,
      "fixed_bit_set_memory_in_bytes" : 3624264,
      "max_unsafe_auto_id_timestamp" : 1669852804058,
      "file_sizes" : { }
    },
    "mappings" : {
      "field_types" : [
        {
          "name" : "binary",
          "count" : 1,
          "index_count" : 1,
          "script_count" : 0
        },
        {
          "name" : "boolean",
          "count" : 308,
          "index_count" : 23,
          "script_count" : 0
        },
        {
          "name" : "byte",
          "count" : 1,
          "index_count" : 1,
          "script_count" : 0
        },
        {
          "name" : "constant_keyword",
          "count" : 2,
          "index_count" : 1,
          "script_count" : 0
        },
        {
          "name" : "date",
          "count" : 299,
          "index_count" : 101,
          "script_count" : 0
        },
        {
          "name" : "date_nanos",
          "count" : 1,
          "index_count" : 1,
          "script_count" : 0
        },
        {
          "name" : "date_range",
          "count" : 1,
          "index_count" : 1,
          "script_count" : 0
        },
        {
          "name" : "double",
          "count" : 45,
          "index_count" : 3,
          "script_count" : 0
        },
        {
          "name" : "double_range",
          "count" : 1,
          "index_count" : 1,
          "script_count" : 0
        },
        {
          "name" : "float",
          "count" : 56,
          "index_count" : 21,
          "script_count" : 0
        },
        {
          "name" : "float_range",
          "count" : 1,
          "index_count" : 1,
          "script_count" : 0
        },
        {
          "name" : "geo_point",
          "count" : 5,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "geo_shape",
          "count" : 1,
          "index_count" : 1,
          "script_count" : 0
        },
        {
          "name" : "half_float",
          "count" : 36,
          "index_count" : 8,
          "script_count" : 0
        },
        {
          "name" : "integer",
          "count" : 173,
          "index_count" : 18,
          "script_count" : 0
        },
        {
          "name" : "integer_range",
          "count" : 1,
          "index_count" : 1,
          "script_count" : 0
        },
        {
          "name" : "ip",
          "count" : 2,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "ip_range",
          "count" : 1,
          "index_count" : 1,
          "script_count" : 0
        },
        {
          "name" : "keyword",
          "count" : 2651,
          "index_count" : 102,
          "script_count" : 0
        },
        {
          "name" : "long",
          "count" : 1441,
          "index_count" : 87,
          "script_count" : 0
        },
        {
          "name" : "long_range",
          "count" : 1,
          "index_count" : 1,
          "script_count" : 0
        },
        {
          "name" : "nested",
          "count" : 63,
          "index_count" : 19,
          "script_count" : 0
        },
        {
          "name" : "object",
          "count" : 2327,
          "index_count" : 48,
          "script_count" : 0
        },
        {
          "name" : "shape",
          "count" : 1,
          "index_count" : 1,
          "script_count" : 0
        },
        {
          "name" : "short",
          "count" : 2,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "text",
          "count" : 1828,
          "index_count" : 80,
          "script_count" : 0
        }
      ],
      "runtime_field_types" : [ ]
    },
    "analysis" : {
      "char_filter_types" : [ ],
      "tokenizer_types" : [ ],
      "filter_types" : [ ],
      "analyzer_types" : [ ],
      "built_in_char_filters" : [ ],
      "built_in_tokenizers" : [ ],
      "built_in_filters" : [ ],
      "built_in_analyzers" : [
        {
          "name" : "whitespace",
          "count" : 1,
          "index_count" : 1
        }
      ]
    },
    "versions" : [
      {
        "version" : "7.13.0",
        "index_count" : 189,
        "primary_shard_count" : 189,
        "total_primary_bytes" : 68251801341
      }
    ]
  },
  "nodes" : {
    "count" : {
      "total" : 3,
      "coordinating_only" : 0,
      "data" : 3,
      "data_cold" : 3,
      "data_content" : 3,
      "data_frozen" : 3,
      "data_hot" : 3,
      "data_warm" : 3,
      "ingest" : 3,
      "master" : 3,
      "ml" : 3,
      "remote_cluster_client" : 3,
      "transform" : 3,
      "voting_only" : 0
    },
    "versions" : [
      "7.13.0"
    ],
    "os" : {
      "available_processors" : 24,
      "allocated_processors" : 24,
      "names" : [
        {
          "name" : "Linux",
          "count" : 3
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "Ubuntu 18.04.5 LTS",
          "count" : 3
        }
      ],
      "architectures" : [
        {
          "arch" : "amd64",
          "count" : 3
        }
      ],
      "mem" : {
        "total_in_bytes" : 101027368960,
        "free_in_bytes" : 12227616768,
        "used_in_bytes" : 88799752192,
        "free_percent" : 12,
        "used_percent" : 88
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 6
      },
      "open_file_descriptors" : {
        "min" : 1594,
        "max" : 2031,
        "avg" : 1770
      }
    },
    "jvm" : {
      "max_uptime_in_millis" : 81473324,
      "versions" : [
        {
          "version" : "16",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "16+36",
          "vm_vendor" : "AdoptOpenJDK",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 3
        }
      ],
      "mem" : {
        "heap_used_in_bytes" : 11819577384,
        "heap_max_in_bytes" : 25769803776
      },
      "threads" : 539
    },
    "fs" : {
      "total_in_bytes" : 3122490912768,
      "free_in_bytes" : 2937848168448,
      "available_in_bytes" : 2937797836800
    },
    "plugins" : [ ],
    "network_types" : {
      "transport_types" : {
        "security4" : 3
      },
      "http_types" : {
        "security4" : 3
      }
    },
    "discovery_types" : {
      "zen" : 3
    },
    "packaging_types" : [
      {
        "flavor" : "default",
        "type" : "deb",
        "count" : 3
      }
    ],
    "ingest" : {
      "number_of_pipelines" : 18,
      "processor_stats" : {
        "conditional" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "geoip" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "grok" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "gsub" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "remove" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "rename" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "script" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "set" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        }
      }
    }
  }
}

Waiting for your response.

Regards,
Souvik

Souvik_Das · December 7, 2022, 9:44am

Hi @warkolm, Kindly look into this.

warkolm · December 7, 2022, 9:37pm

Are all your nodes on the same network?

You really need to upgrade your version as 7.13 is EOL and not supported. Upgrade your JVM as well

Souvik_Das · December 8, 2022, 8:39am

Hi @warkolm , Yes, all the nodes are on the same network. Any clue?

Regarding the EOL,
Our subscription is valid till May 2023. If we upgrade it to a newer version now, do we need to download the license file again for the newer version?

Can you please provide some references to upgrade it from 7.13 to 8.X?

Regards,
Souvik

warkolm · December 11, 2022, 8:39pm

If you have a subscription then you should really be chatting to Support directly about this issue

system · January 8, 2023, 8:39pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
CircuitBreakingException [parent] Data too large in 7.4.2 Elasticsearch	2	2348	May 19, 2020
CircuitBreakingException: [parent] Data too large Elasticsearch	5	758	October 19, 2021
org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [indices:data/write/bulk[s][r]] Elasticsearch	14	8002	August 3, 2021
CircuitBreakingException: [parent] Data too large IN ES 7.3.2 Elasticsearch	5	324	April 26, 2023
Circuit break exception - version 7.7.0 Elasticsearch	4	497	October 11, 2021

CircuitBreakingException: [parent] Data too large

Related topics