ElasticSearch cluster down due to high memory usage

I have ran some queries on ES, which fetched huge amount of data and due to that Memory utilization reached high and ES cluster went down.
Below is the error that ES-java client has thrown

{"error":{"root_cause":[{"type":"circuit_breaking_exception","reason":"[parent] Data too large, data for [<http_request>] would be [4031859746/3.7gb], which is larger than the limit of [3865051136/3.5gb], real usage: [4031859168/3.7gb], new bytes reserved: [578/578b], usages [model_inference=0/0b, inflight_requests=47776/46.6kb, request=1290403896/1.2gb, fielddata=340509/332.5kb, eql_sequence=0/0b]","bytes_wanted":4031859746,"bytes_limit":3865051136,"durability":"TRANSIENT"}],"type":"circuit_breaking_exception","reason":"[parent] Data too large, data for [<http_request>] would be [4031859746/3.7gb], which is larger than the limit of [3865051136/3.5gb], real usage: [4031859168/3.7gb], new bytes reserved: [578/578b], usages [model_inference=0/0b, inflight_requests=47776/46.6kb, request=1290403896/1.2gb, fielddata=340509/332.5kb, eql_sequence=0/0b]","bytes_wanted":4031859746,"bytes_limit":3865051136,"durability":"TRANSIENT"},"status":429}

in ES node found below logs for this where GC not able to bring memory usage down

[2023-07-06T07:44:51,220][WARN ][o.e.m.j.JvmGcMonitorService] [es-1.com] [gc][9226168] overhead, spent [563ms] collecting in the last [1s]
[2023-07-06T07:50:43,768][WARN ][o.e.m.j.JvmGcMonitorService] [es-1.com] [gc][9226520] overhead, spent [852ms] collecting in the last [1.1s]
[2023-07-06T07:50:44,368][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [es-1.com] attempting to trigger G1GC due to high heap usage [3870694016]
[2023-07-06T07:50:44,375][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [es-1.com] GC did not bring memory usage down, before [3870694016], after [3885846280], allocations [1], duration [7]
[2023-07-06T07:50:45,153][WARN ][o.e.m.j.JvmGcMonitorService] [es-1.com] [gc][9226521] overhead, spent [1.2s] collecting in the last [1.4s]
[2023-07-06T07:50:46,291][WARN ][o.e.m.j.JvmGcMonitorService] [es-1.com] [gc][9226522] overhead, spent [1.1s] collecting in the last [1.1s]
[2023-07-06T07:51:21,814][WARN ][o.e.t.ThreadPool         ] [es-1.com] timer thread slept for [30.2s/30283ms] on absolute clock which is above the warn threshold of [5000ms]

There are 3 VM nodes in ES cluster and each node is having all roles. My ES clusters node roles are like this:

cdfhilmrstw
cdfhilmrstw
cdfhilmrstw

Not able to understand why ES VM went down due to this? is there any configuration through which instead of shut-down, VM get restarted for such scenarios? so that it don't impact production traffic

Hi @maulik_trapasiya,

It looks like this part of the logs answers your question as to why ES went down. We would recommend splitting larger requests into smaller ones to prevent this happening.

In terms of having Elasticsearch auto-restart, it depends on how you running it. Am I correct in assuming you are running Elasticsearch in a VM?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.