Hello,
We have just migrated our cluster from 6.8 to 7.12 and just noticed some recurring errors during intensive parallel indexations.
Here is the kind of error we have:
[parent] Data too large, data for [<http_request>] would be [16400370248/15.2gb], which is larger than the limit of [16320875724/15.1gb], real usage: [16400370248/15.2gb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=8667837/8.2mb, in_flight_requests=0/0b, model_inference=0/0b, accounting=474802236/452.8mb]
From what I have seen on this forum for this error, the fact that it did not appeared before is caused either (or both) by ES using real memory for calculating parent circuit breaker or/and that the JVM shipped with ES is using the G1GC instead of CMS (we are using the shipped jvm, our jvm.options are the default, except for the Heap size that is set to half our servers memory).
In some topics on this forum, I have seen that using CMS instead of G1GC solved the issue for some users, but this doesn't seem possible with 7.12 as the JVM shipped with it doesn't support this any more. I think we could use a different JVM from the one shipped with ES but we would prefer to avoid that.
Another solution I have seen, but I think is unadvised by ES members, is disabling the indices.breaker.total.use_real_memory
. This would be the workaround we go with, but it seems to be a really useful functionality to prevent out of memory issues, so if possible, we would rather not touch it.
Instead I'm trying to see what could cause the memory to increase at such a level in our intensive process but so far I haven't come to any definitive conclusions.
I don't know if it's expected, but calling _nodes/stats/breaker
at any time shows that every breaker are very low on usage, expect for the parent breaker which, every time I checked, is greater than 10Gb on 15.1 total. I know that with the use_real_memory option it's not supposed to represent the sum total of child breakers, but is there a way to know what's taking so much memory? I don't understand why it's so full...
For context, if needed, most of the time when we saw this error, it happened when we were monitoring a few big _update_by_query tasks launched with a painless script (which access nested elements and add a nested element within the doc, if that matters). Those tasks are launched with a query on max 5000 parent ids (has_parent => ids) and the script is using a parameter having the same number of values (for each parent id, a integer value that we use in the script).
Thanks for any help provided.