Elastic nodes large heap usage - "Data too large, data for [<http_request>] would be..."

Hi,

I have a 2-node Elastic cluster, each node has 6GB RAM and 2CPUs. I've assigned a max heap of 2GB to ElasticSearch. There's a tiny amount of data in there as I'm still trying to get it set up properly.

At least once a day 1 of the indexes gets marked yellow and I can see that at least 1 shard isn't being assigned because of a triggered circuit breaker. I'll just post part of the output of /_cluster/allocation/explain?pretty as it seems to have all the info:

  "index" : "logstash-2019.08.16",
  "shard" : 0,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2019-08-16T00:11:19.260Z",
    "failed_allocation_attempts" : 5,
    "details" : "failed shard on node [1S5LcKcjT2mCPSo31Onlfg]: failed recovery, failure RecoveryFailedException[[logstash-2019.08.16][0]: Recovery failed from {infra-elk01-inst-a}{BBu_2gYXQn2KXe8izUYs1Q}{OJMhGvE-ThCtBpqZ0FVGjQ}{<node01>}{<node01>:9300}{dim}{ml.machine_memory=6085251072, ml.max_open_jobs=20, xpack.installed=true} into {infra-elk02-inst-a}{1S5LcKcjT2mCPSo31Onlfg}{bpmtxcHPS96u0HxqU4qiSw}{<node02>}{<node02>:9300}{dim}{ml.machine_memory=6085246976, xpack.installed=true, ml.max_open_jobs=20}]; nested: RemoteTransportException[[infra-elk01-inst-a][<:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] prepare target for translog failed]; nested: RemoteTransportException[[infra-elk02-inst-a][<node02>:9300][internal:index/shard/recovery/prepare_translog]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [2101155290/1.9gb], which is larger than the limit of [2040109465/1.8gb], real usage: [2101154736/1.9gb], new bytes reserved: [554/554b], usages [request=0/0b, fielddata=1928/1.8kb, in_flight_requests=554/554b, accounting=411900/402.2kb]]; ",
    "last_allocation_status" : "no_attempt"

I've also noticed the heap usage seems to climb very high until it also triggers a circuit breaker just by checking the node status.

root@infra-elk01.sac1:~# curl -u jparker -s 'http://localhost:9200/_cat/nodes?v'
Enter host password for user 'jparker':
ip            heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
<node01-ip>           87          95  13    0.16    0.23     0.18 dim       *      infra-elk01-inst-a
<node02-ip>           13          92   3    0.14    0.15     0.20 dim       -      infra-elk02-inst-a
root@infra-elk01.sac1:~# curl -u jparker -s 'http://localhost:9200/_cat/nodes?v'
Enter host password for user 'jparker':
{"error":{"root_cause":[{"type":"circuit_breaking_exception","reason":"[parent] Data too large, data for [<http_request>] would be [2092117976/1.9gb], which is larger than the limit of [2040109465/1.8gb], real usage: [2092117976/1.9gb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=1804/1.7kb, in_flight_requests=0/0b, accounting=421354/411.4kb]","bytes_wanted":2092117976,"bytes_limit":2040109465,"durability":"PERMANENT"}],"type":"circuit_breaking_exception","reason":"[parent] Data too large, data for [<http_request>] would be [2092117976/1.9gb], which is larger than the limit of [2040109465/1.8gb], real usage: [2092117976/1.9gb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=1804/1.7kb, in_flight_requests=0/0b, accounting=421354/411.4kb]","bytes_wanted":2092117976,"bytes_limit":2040109465,"durability":"PERMANENT"},"status":429}

If I force it to re-assign with [/_cluster/reroute?retry_failed=true] it all goes green again.

There's only a very tiny amount of data on this cluster (currently just 5 indices and they're only a few mb in size each) so I don't understand how it can use this much heap.

I can add more heap if that will resolve the issue but I'm concerned the same thing will keep happening.

Here's the typical heap usage when it's running normally:

ip            heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
<node01-ip>           17          95   5    0.27    0.15     0.14 dim       *      infra-elk01-inst-a
<node02-ip>           14          92   5    0.62    0.20     0.16 dim       -      infra-elk02-inst-a

Any ideas?

Edit: I managed to get monitoring of the cluster working and noticed this (on elk01 only):
15%20pm

Thanks,
Jeremy

1 Like

Just wanted to add, in case anyone has the same issue:

I had some un-expanded template variables in my JVM options for the "gclog" setting. I fixed that up and enabled GC logging and restarted both nodes and I haven't had any issues since. Still seems like very strange behaviour whether or not it was caused by a bad config file.

Hi @jenga,

in order to get closer at how this could happen, it will be necessary to get information on:

  1. the ES version
  2. the java version
  3. which GC you are using
  4. whether the mangled gclog settings could have made java revert to the default GC (could try to unfix it and check).

Hi Henning,

  1. 7.3.0
  2. I appear to be using a JVM under /usr/share/elasticsearch/jdk - openjdk 12.0.1 2019-04-16
  3. The default for 7.3.0 - I copied the JVM options from the Elastic Github repo, as the Puppet module I'm using had outdated settings.
  4. That could certainly be it, although since I have it working now I'd rather not mess with it. If I get a chance to test I'll post back here.

Thanks,
Jeremy

Hi @jenga,

thanks for the info. A non default GC (CMS is the ES default) could impact the circuit breaker, would be interesting to know which GC was in use when this happened. Hope you can find the time/slot to try it out.