Elastic nodes large heap usage - "Data too large, data for [<http_request>] would be..."

jenga · August 16, 2019, 3:57am

Hi,

I have a 2-node Elastic cluster, each node has 6GB RAM and 2CPUs. I've assigned a max heap of 2GB to ElasticSearch. There's a tiny amount of data in there as I'm still trying to get it set up properly.

At least once a day 1 of the indexes gets marked yellow and I can see that at least 1 shard isn't being assigned because of a triggered circuit breaker. I'll just post part of the output of /_cluster/allocation/explain?pretty as it seems to have all the info:

  "index" : "logstash-2019.08.16",
  "shard" : 0,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2019-08-16T00:11:19.260Z",
    "failed_allocation_attempts" : 5,
    "details" : "failed shard on node [1S5LcKcjT2mCPSo31Onlfg]: failed recovery, failure RecoveryFailedException[[logstash-2019.08.16][0]: Recovery failed from {infra-elk01-inst-a}{BBu_2gYXQn2KXe8izUYs1Q}{OJMhGvE-ThCtBpqZ0FVGjQ}{<node01>}{<node01>:9300}{dim}{ml.machine_memory=6085251072, ml.max_open_jobs=20, xpack.installed=true} into {infra-elk02-inst-a}{1S5LcKcjT2mCPSo31Onlfg}{bpmtxcHPS96u0HxqU4qiSw}{<node02>}{<node02>:9300}{dim}{ml.machine_memory=6085246976, xpack.installed=true, ml.max_open_jobs=20}]; nested: RemoteTransportException[[infra-elk01-inst-a][<:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] prepare target for translog failed]; nested: RemoteTransportException[[infra-elk02-inst-a][<node02>:9300][internal:index/shard/recovery/prepare_translog]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [2101155290/1.9gb], which is larger than the limit of [2040109465/1.8gb], real usage: [2101154736/1.9gb], new bytes reserved: [554/554b], usages [request=0/0b, fielddata=1928/1.8kb, in_flight_requests=554/554b, accounting=411900/402.2kb]]; ",
    "last_allocation_status" : "no_attempt"

I've also noticed the heap usage seems to climb very high until it also triggers a circuit breaker just by checking the node status.

root@infra-elk01.sac1:~# curl -u jparker -s 'http://localhost:9200/_cat/nodes?v'
Enter host password for user 'jparker':
ip            heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
<node01-ip>           87          95  13    0.16    0.23     0.18 dim       *      infra-elk01-inst-a
<node02-ip>           13          92   3    0.14    0.15     0.20 dim       -      infra-elk02-inst-a
root@infra-elk01.sac1:~# curl -u jparker -s 'http://localhost:9200/_cat/nodes?v'
Enter host password for user 'jparker':
{"error":{"root_cause":[{"type":"circuit_breaking_exception","reason":"[parent] Data too large, data for [<http_request>] would be [2092117976/1.9gb], which is larger than the limit of [2040109465/1.8gb], real usage: [2092117976/1.9gb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=1804/1.7kb, in_flight_requests=0/0b, accounting=421354/411.4kb]","bytes_wanted":2092117976,"bytes_limit":2040109465,"durability":"PERMANENT"}],"type":"circuit_breaking_exception","reason":"[parent] Data too large, data for [<http_request>] would be [2092117976/1.9gb], which is larger than the limit of [2040109465/1.8gb], real usage: [2092117976/1.9gb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=1804/1.7kb, in_flight_requests=0/0b, accounting=421354/411.4kb]","bytes_wanted":2092117976,"bytes_limit":2040109465,"durability":"PERMANENT"},"status":429}

If I force it to re-assign with [/_cluster/reroute?retry_failed=true] it all goes green again.

There's only a very tiny amount of data on this cluster (currently just 5 indices and they're only a few mb in size each) so I don't understand how it can use this much heap.

I can add more heap if that will resolve the issue but I'm concerned the same thing will keep happening.

Here's the typical heap usage when it's running normally:

ip            heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
<node01-ip>           17          95   5    0.27    0.15     0.14 dim       *      infra-elk01-inst-a
<node02-ip>           14          92   5    0.62    0.20     0.16 dim       -      infra-elk02-inst-a

Any ideas?

Edit: I managed to get monitoring of the cluster working and noticed this (on elk01 only):
15%20pm

Thanks,
Jeremy

jenga · August 21, 2019, 6:13am

Just wanted to add, in case anyone has the same issue:

I had some un-expanded template variables in my JVM options for the "gclog" setting. I fixed that up and enabled GC logging and restarted both nodes and I haven't had any issues since. Still seems like very strange behaviour whether or not it was caused by a bad config file.

HenningAndersen · August 21, 2019, 6:34am

Hi @jenga,

in order to get closer at how this could happen, it will be necessary to get information on:

the ES version
the java version
which GC you are using
whether the mangled gclog settings could have made java revert to the default GC (could try to unfix it and check).

jenga · August 21, 2019, 6:41am

Hi Henning,

7.3.0
I appear to be using a JVM under /usr/share/elasticsearch/jdk - openjdk 12.0.1 2019-04-16
The default for 7.3.0 - I copied the JVM options from the Elastic Github repo, as the Puppet module I'm using had outdated settings.
That could certainly be it, although since I have it working now I'd rather not mess with it. If I get a chance to test I'll post back here.

Thanks,
Jeremy

HenningAndersen · August 21, 2019, 6:50am

Hi @jenga,

thanks for the info. A non default GC (CMS is the ES default) could impact the circuit breaker, would be interesting to know which GC was in use when this happened. Hope you can find the time/slot to try it out.

system · September 18, 2019, 6:50am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch heap issues Elasticsearch	4	438	July 5, 2017
High heap usage Elasticsearch	6	981	March 8, 2019
Elastic Heap Size issue Elasticsearch	13	703	June 27, 2018
Need help with configuring memory usage Elasticsearch	4	727	July 5, 2017
Data too large, data for [<transport_request>] would be [10554893106/9.8gb], which is larger than the limit of [10092838912/9.3gb], real usage: [10523239224/9.8gb], new bytes reserved: [31653882/30.1mb]]]] Elasticsearch	17	16368	September 2, 2019

Elastic nodes large heap usage - "Data too large, data for [<http_request>] would be..."

Related topics