org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [indices:data/write/bulk[s][r]]

Cluster always get CircuitBreakingException after update to ES7.13.

-Xms31g
-Xmx31g

14-:-XX:+UseG1GC
14-:-XX:G1ReservePercent=25
14-:-XX:InitiatingHeapOccupancyPercent=30

We checked and we have no huge query executed agains the indices. The cluster has 5 nodes. Each one has 64gb ram and SSD.
We had no erros on ES 6.8.3

Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [indices:data/write/bulk[s][r]] would be [32252335886/30gb], which is larger than the limit of [31621696716/29.4gb], real usage: [32252318768/30gb], new bytes reserved: [17118/16.7kb], usages [request=0/0b, fielddata=9820231969/9.1gb, in_flight_requests=17118/16.7kb, model_inference=0/0b, accounting=289864524/276.4mb]

I can suggest to check in _nodes/stats jvm metrics to understand how jvm memory is used. The exception you are getting is saying that currently around 30Gb of memory is already in use, and that's why even your small request of 16.7kb trips the circuit breaker.

Looks like you are using real memory circuit breaker, you can try to disable it temporarily which will use the previous accounting method for memory, and see if you are still getting circuit breaking exceptions.

1 Like

Hello, It looks that after we restart each node, for a while the GC works as expected. At some point the Heap reaches 26-27 gb and the errors start appearing. I suspect that something is blocking the GC to work correctly.
We will try to disable it temporary. JVM from a node

"jvm": {
				"timestamp": 1623655124478,
				"uptime_in_millis": 1549586,
				"mem": {
					"heap_used_in_bytes": 24071107312,
					"heap_used_percent": 72,
					"heap_committed_in_bytes": 33285996544,
					"heap_max_in_bytes": 33285996544,
					"non_heap_used_in_bytes": 233079120,
					"non_heap_committed_in_bytes": 239403008,
					"pools": {
						"young": {
							"used_in_bytes": 13555990528,
							"max_in_bytes": 0,
							"peak_used_in_bytes": 19495124992,
							"peak_max_in_bytes": 0
						},
						"old": {
							"used_in_bytes": 10119818752,
							"max_in_bytes": 33285996544,
							"peak_used_in_bytes": 10248005632,
							"peak_max_in_bytes": 33285996544
						},
						"survivor": {
							"used_in_bytes": 395298032,
							"max_in_bytes": 0,
							"peak_used_in_bytes": 1577058304,
							"peak_max_in_bytes": 0
						}
					}
				},
				"threads": {
					"count": 177,
					"peak_count": 245
				},
				"gc": {
					"collectors": {
						"young": {
							"collection_count": 190,
							"collection_time_in_millis": 10204
						},
						"old": {
							"collection_count": 0,
							"collection_time_in_millis": 0
						}
					}
				},
				"buffer_pools": {
					"mapped": {
						"count": 6905,
						"used_in_bytes": 484894438634,
						"total_capacity_in_bytes": 484894438634
					},
					"direct": {
						"count": 192,
						"used_in_bytes": 36719564,
						"total_capacity_in_bytes": 36719563
					},
					"mapped - 'non-volatile memory'": {
						"count": 0,
						"used_in_bytes": 0,
						"total_capacity_in_bytes": 0
					}
				},
				"classes": {
					"current_loaded_count": 25073,
					"total_loaded_count": 25150,
					"total_unloaded_count": 77
				}
			},

All nodes leave the cluster after half day. Horrible update to ES7.13

What do your Elasticsearch logs show?
What is the output from the _cluster/stats?pretty&human API?

Hello, i dont see much in the logs. Output

{
  "_nodes" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "cluster_name" : "Cluster 1 ES-WEB",
  "cluster_uuid" : "9b5IZCbFTTW8FEyShptVNw",
  "timestamp" : 1624431616018,
  "status" : "green",
  "indices" : {
    "count" : 456,
    "shards" : {
      "total" : 3455,
      "primaries" : 691,
      "replication" : 4.0,
      "index" : {
        "shards" : {
          "min" : 5,
          "max" : 30,
          "avg" : 7.576754385964913
        },
        "primaries" : {
          "min" : 1,
          "max" : 6,
          "avg" : 1.5153508771929824
        },
        "replication" : {
          "min" : 4.0,
          "max" : 4.0,
          "avg" : 4.0
        }
      }
    },
    "docs" : {
      "count" : 991951650,
      "deleted" : 397355
    },
    "store" : {
      "size" : "2.6tb",
      "size_in_bytes" : 2863072292356,
      "total_data_set_size" : "2.6tb",
      "total_data_set_size_in_bytes" : 2863072292356,
      "reserved" : "0b",
      "reserved_in_bytes" : 0
    },
    "fielddata" : {
      "memory_size" : "27.7gb",
      "memory_size_in_bytes" : 29772016912,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size" : "15.4gb",
      "memory_size_in_bytes" : 16624768238,
      "total_count" : 106865225312,
      "hit_count" : 7932106209,
      "miss_count" : 98933119103,
      "cache_size" : 2662619,
      "cache_count" : 46091220,
      "evictions" : 43428601
    },
    "completion" : {
      "size" : "0b",
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 27696,
      "memory" : "1.3gb",
      "memory_in_bytes" : 1474433292,
      "terms_memory" : "1.1gb",
      "terms_memory_in_bytes" : 1203651568,
      "stored_fields_memory" : "13.9mb",
      "stored_fields_memory_in_bytes" : 14610400,
      "term_vectors_memory" : "0b",
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory" : "183.2mb",
      "norms_memory_in_bytes" : 192100096,
      "points_memory" : "0b",
      "points_memory_in_bytes" : 0,
      "doc_values_memory" : "61.1mb",
      "doc_values_memory_in_bytes" : 64071228,
      "index_writer_memory" : "438.6mb",
      "index_writer_memory_in_bytes" : 459966552,
      "version_map_memory" : "6.7kb",
      "version_map_memory_in_bytes" : 6918,
      "fixed_bit_set" : "411mb",
      "fixed_bit_set_memory_in_bytes" : 431052992,
      "max_unsafe_auto_id_timestamp" : 1624025102136,
      "file_sizes" : { }
    },
    "mappings" : {
      "field_types" : [
        {
          "name" : "boolean",
          "count" : 177,
          "index_count" : 59,
          "script_count" : 0
        },
        {
          "name" : "date",
          "count" : 62,
          "index_count" : 61,
          "script_count" : 0
        },
        {
          "name" : "float",
          "count" : 4065,
          "index_count" : 307,
          "script_count" : 0
        },
        {
          "name" : "geo_point",
          "count" : 310,
          "index_count" : 299,
          "script_count" : 0
        },
        {
          "name" : "geo_shape",
          "count" : 72,
          "index_count" : 72,
          "script_count" : 0
        },
        {
          "name" : "keyword",
          "count" : 42923,
          "index_count" : 456,
          "script_count" : 0
        },
        {
          "name" : "long",
          "count" : 118,
          "index_count" : 59,
          "script_count" : 0
        },
        {
          "name" : "nested",
          "count" : 228,
          "index_count" : 228,
          "script_count" : 0
        },
        {
          "name" : "object",
          "count" : 936,
          "index_count" : 288,
          "script_count" : 0
        },
        {
          "name" : "text",
          "count" : 33734,
          "index_count" : 443,
          "script_count" : 0
        }
      ],
      "runtime_field_types" : [ ]
    },
    "analysis" : {
      "char_filter_types" : [
        {
          "name" : "pattern_replace",
          "count" : 1435,
          "index_count" : 384
        }
      ],
      "tokenizer_types" : [ ],
      "filter_types" : [
        {
          "name" : "edge_ngram",
          "count" : 379,
          "index_count" : 379
        }
      ],
      "analyzer_types" : [
        {
          "name" : "custom",
          "count" : 2571,
          "index_count" : 384
        }
      ],
      "built_in_char_filters" : [ ],
      "built_in_tokenizers" : [
        {
          "name" : "icu_tokenizer",
          "count" : 363,
          "index_count" : 363
        },
        {
          "name" : "keyword",
          "count" : 1736,
          "index_count" : 384
        },
        {
          "name" : "standard",
          "count" : 93,
          "index_count" : 93
        },
        {
          "name" : "whitespace",
          "count" : 379,
          "index_count" : 379
        }
      ],
      "built_in_filters" : [
        {
          "name" : "icu_folding",
          "count" : 1588,
          "index_count" : 384
        },
        {
          "name" : "lowercase",
          "count" : 983,
          "index_count" : 384
        },
        {
          "name" : "trim",
          "count" : 1046,
          "index_count" : 384
        }
      ],
      "built_in_analyzers" : [
        {
          "name" : "keyword",
          "count" : 432,
          "index_count" : 216
        }
      ]
    },
    "versions" : [
      {
        "version" : "7.13.0",
        "index_count" : 454,
        "primary_shard_count" : 684,
        "total_primary_size" : "530.7gb",
        "total_primary_bytes" : 569901209735
      },
      {
        "version" : "7.13.1",
        "index_count" : 2,
        "primary_shard_count" : 7,
        "total_primary_size" : "2.5gb",
        "total_primary_bytes" : 2724658118
      }
    ]
  },
  "nodes" : {
    "count" : {
      "total" : 5,
      "coordinating_only" : 0,
      "data" : 5,
      "data_cold" : 5,
      "data_content" : 5,
      "data_frozen" : 5,
      "data_hot" : 5,
      "data_warm" : 5,
      "ingest" : 5,
      "master" : 5,
      "ml" : 5,
      "remote_cluster_client" : 5,
      "transform" : 5,
      "voting_only" : 0
    },
    "versions" : [
      "7.13.1"
    ],
    "os" : {
      "available_processors" : 160,
      "allocated_processors" : 160,
      "names" : [
        {
          "name" : "Linux",
          "count" : 5
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "Ubuntu 20.04.2 LTS",
          "count" : 5
        }
      ],
      "architectures" : [
        {
          "arch" : "amd64",
          "count" : 5
        }
      ],
      "mem" : {
        "total" : "312.9gb",
        "total_in_bytes" : 336060563456,
        "free" : "2.2gb",
        "free_in_bytes" : 2433830912,
        "used" : "310.7gb",
        "used_in_bytes" : 333626732544,
        "free_percent" : 1,
        "used_percent" : 99
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 113
      },
      "open_file_descriptors" : {
        "min" : 2915,
        "max" : 2932,
        "avg" : 2925
      }
    },
    "jvm" : {
      "max_uptime" : "8.4d",
      "max_uptime_in_millis" : 728397727,
      "versions" : [
        {
          "version" : "16",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "16+36",
          "vm_vendor" : "AdoptOpenJDK",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 5
        }
      ],
      "mem" : {
        "heap_used" : "118gb",
        "heap_used_in_bytes" : 126711285488,
        "heap_max" : "155gb",
        "heap_max_in_bytes" : 166429982720
      },
      "threads" : 888
    },
    "fs" : {
      "total" : "8.5tb",
      "total_in_bytes" : 9437255413760,
      "free" : "5.9tb",
      "free_in_bytes" : 6494951006208,
      "available" : "5.4tb",
      "available_in_bytes" : 6032226689024
    },
    "plugins" : [
      {
        "name" : "analysis-icu",
        "version" : "7.13.1",
        "elasticsearch_version" : "7.13.1",
        "java_version" : "1.8",
        "description" : "The ICU Analysis plugin integrates the Lucene ICU module into Elasticsearch, adding ICU-related analysis components.",
        "classname" : "org.elasticsearch.plugin.analysis.icu.AnalysisICUPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false,
        "licensed" : false,
        "type" : "isolated"
      }
    ],
    "network_types" : {
      "transport_types" : {
        "security4" : 5
      },
      "http_types" : {
        "security4" : 5
      }
    },
    "discovery_types" : {
      "zen" : 5
    },
    "packaging_types" : [
      {
        "flavor" : "default",
        "type" : "deb",
        "count" : 5
      }
    ],
    "ingest" : {
      "number_of_pipelines" : 2,
      "processor_stats" : {
        "gsub" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "script" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        }
      }
    }
  }
}

You should look to reduce your shard count, that will help memory pressure.

Ok but what you don't see we might, so they are useful to post.

Part of the ES log

[2021-06-14T08:50:49,839][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [SRV-ESWEB12] attempting to trigger G1GC due to high heap usage [31625039864]
[2021-06-14T08:50:49,878][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [SRV-ESWEB12] GC did bring memory usage down, before [31625039864], after [31064673272], allocations [1], duration [39]
[2021-06-14T08:50:50,676][WARN ][o.e.t.TaskCancellationService] [SRV-ESWEB12] Cannot send ban for tasks with the parent [vnIns-mKRg-_3jNiQUjWqA:322788596] for connection [org.elasticsearch.transport.TcpTransport$NodeChannels@4b4a808d]
[2021-06-14T08:50:50,691][INFO ][o.e.t.TaskCancellationService] [SRV-ESWEB12] failed to remove the parent ban for task vnIns-mKRg-_3jNiQUjWqA:322788596 for connection org.elasticsearch.transport.TcpTransport$NodeChannels@4b4a808d
[2021-06-14T08:50:50,691][WARN ][r.suppressed             ] [SRV-ESWEB12] path: /_all/_search, params: {typed_keys=true, preference=_local, index=_all}
org.elasticsearch.action.search.SearchPhaseExecutionException: 
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:661) [elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.action.search.FetchSearchPhase$1.onFailure(FetchSearchPhase.java:89) [elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:28) [elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33) [elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732) [elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.13.0.jar:7.13.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
	at java.lang.Thread.run(Thread.java:831) [?:?]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<reduce_aggs>] would be [31635098696/29.4gb], which is larger than the limit of [31621696716/29.4gb], real usage: [31635098616/29.4gb], new bytes reserved: [80/80b], usages [request=157456/153.7kb, fielddata=9569907135/8.9gb, in_flight_requests=20296/19.8kb, model_inference=0/0b, accounting=287115504/273.8mb]
	at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:335) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:97) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.action.search.QueryPhaseResultConsumer$PendingMerges.addEstimateAndMaybeBreak(QueryPhaseResultConsumer.java:272) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.action.search.QueryPhaseResultConsumer$PendingMerges.consume(QueryPhaseResultConsumer.java:311) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.action.search.QueryPhaseResultConsumer.consumeResult(QueryPhaseResultConsumer.java:110) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardResult(AbstractSearchAsyncAction.java:551) [elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.action.search.SearchQueryThenFetchAsyncAction.onShardResult(SearchQueryThenFetchAsyncAction.java:99) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.action.search.AbstractSearchAsyncAction$1.innerOnResponse(AbstractSearchAsyncAction.java:305) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.action.search.SearchActionListener.onResponse(SearchActionListener.java:34) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.action.search.SearchActionListener.onResponse(SearchActionListener.java:18) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.action.search.SearchExecutionStatsCollector.onResponse(SearchExecutionStatsCollector.java:56) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.action.search.SearchExecutionStatsCollector.onResponse(SearchExecutionStatsCollector.java:25) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:43) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleResponse(SearchTransportService.java:391) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.transport.TransportService$5.handleResponse(TransportService.java:732) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1273) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:291) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.transport.InboundHandler.handleResponse(InboundHandler.java:275) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:128) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:84) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:693) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:129) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:104) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:69) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:63) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:271) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[?:?]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:615) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:578) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[?:?]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
	... 1 more
[2021-06-14T08:50:54,961][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [SRV-ESWEB12] attempting to trigger G1GC due to high heap usage [31651875832]
[2021-06-14T08:50:54,992][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [SRV-ESWEB12] GC did bring memory usage down, before [31651875832], after [31336843768], allocations [1], duration [31]
[2021-06-14T08:51:00,110][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [SRV-ESWEB12] attempting to trigger G1GC due to high heap usage [31638833656]
[2021-06-14T08:51:00,131][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [SRV-ESWEB12] GC did bring memory usage down, before [31638833656], after [31391828464], allocations [1], duration [22]
[2021-06-14T08:51:00,555][WARN ][r.suppressed             ] [SRV-ESWEB12] path: /universal_it/_search, params: {typed_keys=true, preference=_local, index=universal_it}

The idea is that we have the old ES6 server that is exactly the same as this new one ES7. We dont have any errors on the old server :slight_smile:

What's the aggregation that is causing the breaker to trip?

There are some big aggregations that we are doing. Some of them on multiple levels.
But as i said before.. ES6 is not throwing any error.
This ES7... it works for a day or two then it starts throwing errors and removing nodes from the cluster.
The only difference i see is the GC used by ES7 compared to ES6. Seems that the G1 can't handle it.

Now we get this even on indexation. And as i said before ... never happend on ES6.

circuit_breaking_exception Reason: "[parent] Data too large, data for [indices:data/write/bulk[s]] would be [31626361530/29.4gb], which is larger than the limit of [31621696716/29.4gb], real usage: [31626353864/29.4gb], new bytes reserved: [7666/7.4kb], usages [request=0/0b, fielddata=9938397150/9.2gb, in_flight_requests=10910/10.6kb, model_inference=0/0b, accounting=296816332/283mb

We have lowered the number of shards to aprox 500 shards per node. Same issue .. after 2-3 days the heap goes up and the GC can not release memory

GC Log before nodes were removed from the cluster

[2021-07-06T09:47:04.439+0000][44394][safepoint     ] Safepoint "G1CollectForAllocation", Time since last: 192851174 ns, Reaching safepoint: 310376 ns, At safepoint: 15755931 ns, Total: 16066307 ns
[2021-07-06T09:47:04.452+0000][44394][gc,start      ] GC(216354) Pause Young (Normal) (G1 Evacuation Pause)
[2021-07-06T09:47:04.452+0000][44394][gc,task       ] GC(216354) Using 23 workers of 23 for evacuation
[2021-07-06T09:47:04.452+0000][44394][gc,age        ] GC(216354) Desired survivor size 109051904 bytes, new threshold 15 (max threshold 15)
[2021-07-06T09:47:04.457+0000][44394][gc,age        ] GC(216354) Age table with threshold 15 (max threshold 15)
[2021-07-06T09:47:04.457+0000][44394][gc            ] GC(216354) To-space exhausted
[2021-07-06T09:47:04.457+0000][44394][gc,phases     ] GC(216354)   Pre Evacuate Collection Set: 0.2ms
[2021-07-06T09:47:04.457+0000][44394][gc,phases     ] GC(216354)   Merge Heap Roots: 0.2ms
[2021-07-06T09:47:04.457+0000][44394][gc,phases     ] GC(216354)   Evacuate Collection Set: 1.6ms
[2021-07-06T09:47:04.457+0000][44394][gc,phases     ] GC(216354)   Post Evacuate Collection Set: 2.6ms
[2021-07-06T09:47:04.457+0000][44394][gc,phases     ] GC(216354)   Other: 0.3ms
[2021-07-06T09:47:04.457+0000][44394][gc,heap       ] GC(216354) Eden regions: 2->0(99)
[2021-07-06T09:47:04.457+0000][44394][gc,heap       ] GC(216354) Survivor regions: 0->0(0)
[2021-07-06T09:47:04.457+0000][44394][gc,heap       ] GC(216354) Old regions: 1925->1927
[2021-07-06T09:47:04.457+0000][44394][gc,heap       ] GC(216354) Archive regions: 2->2
[2021-07-06T09:47:04.457+0000][44394][gc,heap       ] GC(216354) Humongous regions: 55->55
[2021-07-06T09:47:04.457+0000][44394][gc,metaspace  ] GC(216354) Metaspace: 126733K(128192K)->126733K(128192K) NonClass: 111231K(112064K)->111231K(112064K) Class: 15501K(16128K)->15501K(16128K)
[2021-07-06T09:47:04.457+0000][44394][gc            ] GC(216354) Pause Young (Normal) (G1 Evacuation Pause) 31529M->31529M(31744M) 4.919ms
[2021-07-06T09:47:04.457+0000][44394][gc,cpu        ] GC(216354) User=0.05s Sys=0.00s Real=0.01s
[2021-07-06T09:47:04.457+0000][44394][gc,ergo       ] Attempting maximally compacting collection
[2021-07-06T09:47:04.457+0000][44394][gc,task       ] GC(216355) Using 23 workers of 23 for full compaction
[2021-07-06T09:47:04.478+0000][44394][gc,start      ] GC(216355) Pause Full (G1 Evacuation Pause)
[2021-07-06T09:47:04.488+0000][44394][gc,phases,start] GC(216355) Phase 1: Mark live objects
[2021-07-06T09:47:04.696+0000][44394][gc,phases      ] GC(216355) Phase 1: Mark live objects 207.980ms
[2021-07-06T09:47:04.697+0000][44394][gc,phases,start] GC(216355) Phase 2: Prepare for compaction
[2021-07-06T09:47:04.766+0000][44394][gc,phases      ] GC(216355) Phase 2: Prepare for compaction 69.802ms
[2021-07-06T09:47:04.766+0000][44394][gc,phases,start] GC(216355) Phase 3: Adjust pointers
[2021-07-06T09:47:04.872+0000][44394][gc,phases      ] GC(216355) Phase 3: Adjust pointers 105.839ms
[2021-07-06T09:47:04.872+0000][44394][gc,phases,start] GC(216355) Phase 4: Compact heap
[2021-07-06T09:47:05.340+0000][44394][gc,phases      ] GC(216355) Phase 4: Compact heap 468.134ms
[2021-07-06T09:47:05.390+0000][44394][gc,heap        ] GC(216355) Eden regions: 0->0(765)
[2021-07-06T09:47:05.390+0000][44394][gc,heap        ] GC(216355) Survivor regions: 0->0(0)
[2021-07-06T09:47:05.390+0000][44394][gc,heap        ] GC(216355) Old regions: 1927->645
[2021-07-06T09:47:05.390+0000][44394][gc,heap        ] GC(216355) Archive regions: 2->2
[2021-07-06T09:47:05.390+0000][44394][gc,heap        ] GC(216355) Humongous regions: 55->36
[2021-07-06T09:47:05.390+0000][44394][gc,metaspace   ] GC(216355) Metaspace: 126733K(128192K)->126729K(128192K) NonClass: 111231K(112064K)->111228K(112064K) Class: 15501K(16128K)->15501K(16128K)
[2021-07-06T09:47:05.390+0000][44394][gc             ] GC(216355) Pause Full (G1 Evacuation Pause) 31529M->10704M(31744M) 912.046ms
[2021-07-06T09:47:05.390+0000][44394][gc,cpu         ] GC(216355) User=18.14s Sys=0.09s Real=0.93s
[2021-07-06T09:47:05.390+0000][44394][safepoint      ] Safepoint "G1CollectForAllocation", Time since last: 12634858 ns, Reaching safepoint: 1045214 ns, At safepoint: 938141774 ns, Total: 939186988 ns
[2021-07-06T09:47:05.391+0000][44394][gc,marking     ] GC(216352) Concurrent Mark From Roots 1160.640ms
[2021-07-06T09:47:05.391+0000][44394][gc,marking     ] GC(216352) Concurrent Mark Abort
[2021-07-06T09:47:05.391+0000][44394][gc             ] GC(216352) Concurrent Mark Cycle 1160.850ms
[2021-07-06T09:47:05.428+0000][44394][safepoint      ] Safepoint "ICBufferFull", Time since last: 37694421 ns, Reaching safepoint: 265076 ns, At safepoint: 14030 ns, Total: 279106 ns
[2021-07-06T09:47:05.538+0000][44394][safepoint      ] Safepoint "ICBufferFull", Time since last: 109740347 ns, Reaching safepoint: 294378 ns, At safepoint: 25691 ns, Total: 320069 ns
[2021-07-06T09:47:05.564+0000][44394][safepoint      ] Safepoint "ICBufferFull", Time since last: 25345376 ns, Reaching safepoint: 337262 ns, At safepoint: 28263 ns, Total: 365525 ns
[2021-07-06T09:47:05.988+0000][44394][gc,heap,exit   ] Heap
[2021-07-06T09:47:05.988+0000][44394][gc,heap,exit   ]  garbage-first heap   total 32505856K, used 11518881K [0x0000001001000000, 0x00000017c1000000)
[2021-07-06T09:47:05.988+0000][44394][gc,heap,exit   ]   region size 16384K, 34 young (557056K), 0 survivors (0K)
[2021-07-06T09:47:05.988+0000][44394][gc,heap,exit   ]  Metaspace       used 126795K, committed 128192K, reserved 1163264K
[2021-07-06T09:47:05.988+0000][44394][gc,heap,exit   ]   class space    used 15517K, committed 16128K, reserved 1048576K

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.