I am debugging an ES 7.17.3 installation that is persistently running out of memory and tripping the parent circuit breaker. This is an example error:
elasticsearch.exceptions.TransportError: TransportError(429, 'circuit_breaking_exception', '[parent] Data too large, data for [<http_request>] would be [8432214884/7.8gb], which is larger than the limit of [8160437862/7.5gb], real usage: [8432214672/7.8gb], new bytes reserved: [212/212b], usages [request=16440/16kb, fielddata=15261674/14.5mb, in_flight_requests=212/212b, model_inference=0/0b, eql_sequence=0/0b, accounting=61581360/58.7mb]')"
I've more than doubled the JVM heap size from 3 GB to 8 GB, but the memory issues are unchanged. The other surprising thing to me is that these error occur during periods of heavy indexing load, but they always seem to be triggered by a search call to ES, not an index call.
For debugging here is the output from the _cat/nodes
endpoint:
name id node.role heap.current heap.percent heap.max
elasticsearch-master-0 uLpr cdfhilmrstw 5.6gb 70 8gb
And the node breaker stats:
{
"uLprTtGlRWq-L2mi-mNeFg": {
"timestamp": 1709934029351,
"name": "elasticsearch-master-0",
"transport_address": "10.1.1.148:9300",
"host": "10.1.1.148",
"ip": "10.1.1.148:9300",
"roles": [
"data",
"data_cold",
"data_content",
"data_frozen",
"data_hot",
"data_warm",
"ingest",
"master",
"ml",
"remote_cluster_client",
"transform"
],
"attributes": {
"ml.machine_memory": "17179869184",
"xpack.installed": "true",
"transform.node": "true",
"ml.max_open_jobs": "512",
"ml.max_jvm_size": "8589934592"
},
"breakers": {
"request": {
"limit_size_in_bytes": 5153960755,
"limit_size": "4.7gb",
"estimated_size_in_bytes": 0,
"estimated_size": "0b",
"overhead": 1,
"tripped": 0
},
"fielddata": {
"limit_size_in_bytes": 3435973836,
"limit_size": "3.1gb",
"estimated_size_in_bytes": 0,
"estimated_size": "0b",
"overhead": 1.03,
"tripped": 0
},
"in_flight_requests": {
"limit_size_in_bytes": 8589934592,
"limit_size": "8gb",
"estimated_size_in_bytes": 0,
"estimated_size": "0b",
"overhead": 2,
"tripped": 0
},
"model_inference": {
"limit_size_in_bytes": 4294967296,
"limit_size": "4gb",
"estimated_size_in_bytes": 0,
"estimated_size": "0b",
"overhead": 1,
"tripped": 0
},
"eql_sequence": {
"limit_size_in_bytes": 4294967296,
"limit_size": "4gb",
"estimated_size_in_bytes": 0,
"estimated_size": "0b",
"overhead": 1,
"tripped": 0
},
"accounting": {
"limit_size_in_bytes": 8589934592,
"limit_size": "8gb",
"estimated_size_in_bytes": 54473076,
"estimated_size": "51.9mb",
"overhead": 1,
"tripped": 0
},
"parent": {
"limit_size_in_bytes": 8160437862,
"limit_size": "7.5gb",
"estimated_size_in_bytes": 5529302008,
"estimated_size": "5.1gb",
"overhead": 1,
"tripped": 3186
}
}
}
}
I'm trying to piece together what is happening and how to resolve it. Is the garbage collector struggling to keep up? Is there a remediation besides still more memory?