Data too large circuit breaking exception after migrating to 7.12

Hello,

We have just migrated our cluster from 6.8 to 7.12 and just noticed some recurring errors during intensive parallel indexations.
Here is the kind of error we have:

[parent] Data too large, data for [<http_request>] would be [16400370248/15.2gb], which is larger than the limit of [16320875724/15.1gb], real usage: [16400370248/15.2gb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=8667837/8.2mb, in_flight_requests=0/0b, model_inference=0/0b, accounting=474802236/452.8mb]

From what I have seen on this forum for this error, the fact that it did not appeared before is caused either (or both) by ES using real memory for calculating parent circuit breaker or/and that the JVM shipped with ES is using the G1GC instead of CMS (we are using the shipped jvm, our jvm.options are the default, except for the Heap size that is set to half our servers memory).

In some topics on this forum, I have seen that using CMS instead of G1GC solved the issue for some users, but this doesn't seem possible with 7.12 as the JVM shipped with it doesn't support this any more. I think we could use a different JVM from the one shipped with ES but we would prefer to avoid that.
Another solution I have seen, but I think is unadvised by ES members, is disabling the indices.breaker.total.use_real_memory. This would be the workaround we go with, but it seems to be a really useful functionality to prevent out of memory issues, so if possible, we would rather not touch it.

Instead I'm trying to see what could cause the memory to increase at such a level in our intensive process but so far I haven't come to any definitive conclusions.

I don't know if it's expected, but calling _nodes/stats/breakerat any time shows that every breaker are very low on usage, expect for the parent breaker which, every time I checked, is greater than 10Gb on 15.1 total. I know that with the use_real_memory option it's not supposed to represent the sum total of child breakers, but is there a way to know what's taking so much memory? I don't understand why it's so full...

For context, if needed, most of the time when we saw this error, it happened when we were monitoring a few big _update_by_query tasks launched with a painless script (which access nested elements and add a nested element within the doc, if that matters). Those tasks are launched with a query on max 5000 parent ids (has_parent => ids) and the script is using a parameter having the same number of values (for each parent id, a integer value that we use in the script).

Thanks for any help provided.

Hello,
We tried optimizing some of our scripts and passing use_real_memory to false and while out request were not killed any more, we just had two nodes failing in the night (as we kind of expected).
There is probably something to do to optimize our requests, but this was not happening with ES 6.7 so there is obviously something using all the heap that was not here before. Is this G1GC or something else?
Manually calling the _cache/clear api for all indices does reduce the heap by like 20Go (on the whole cluster heap). So I'm guessing ES is more caching in this version than previously. Can we prevent this, we have a cache at application level for big queries with aggregations, so is there anything that we could disable on ES caching without putting the cluster at risk?
Thanks.

We tried changing the caches sizes to see if the heap use would improve and noticed a strange thing.
We have set this on each nodes:

indices.fielddata.cache.size: 10%
indices.queries.cache.size: 5%

And disabled the request cache on our most used indices, but from what the doc says, this cache is at 1% by default so it was probably not the issue.

A day after this modification (so after restarting all nodes to apply these statics settings), we noticed the heap was going up again. On our master node, it went to 99%.
To prevent a crash, I manually called the clear cache API (to clear all of it, with POST /_cache/clear) and then noticed the cache on this node going from 99% to around 60%.

I don't understand how that's possible. From what I understand of the clear cache API, it can clear the fielddata, queries & requests caches only. But our new settings set that these caches combined should only represent a maximum of 15% of the heap (or 16% with the requests cache). Then, how can a clear of this max 15% can release around 40% of heap on a node?

The only two explanations I see are:

  • ES does not care about the custom cache size settings we have set (of course we checked the _cluster/settings API and they are just like we have set them).
  • There is another cache I know nothing about cleared by the cache clear API, but I would like to know what it is, why it's so big and how to prevent it from going so big.

In any case, it seems related to something introduced since 6.7 because we never had this issue before, we never even had to customize the caches size as the heap was never that high.

Can anybody help make us understand please?
The only workaround we have for now is a cron calling the clear cache API each day, for obvious reason we would prefer to find another way to avoid this issue.
Thanks.