Hi Eli,
The garbage collector is triggered when the heap usage reaches 75% -XX:InitiatingHeapOccupancyPercent=75
. However, the garbage collector is running concurrently with the application so it can (and will) happen that the application allocates additional heap memory while the heap is being cleaned up. The default garbage collector for Elasticsearch (concurrent mark-sweep (CMS)) also runs concurrently with the application but both garbage collectors have different runtime characteristics so if the concurrent phase takes longer for G1, it will report the higher heap usage until the end of its concurrent cycle and thus the chance is higher that the real memory circuit breaker is tripped. Once the GC cycle finishes, heap usage as seen by the real memory circuit breaker should drop. So you can do a couple of things:
- Switch to the CMS garbage collector.
- Stay with G1 but trigger the concurrent cycle sooner (i.e. reducing
-XX:InitiatingHeapOccupancyPercent
from 75 to a lower value, e.g. 65). Note that this means the garbage collector will run sooner and thus you will see more GC events which might negatively impact performance. It's best that you thoroughly test this change with a representative benchmark before doing this in production. - Increase the threshold when the real memory circuit breaker trips by changing
indices.breaker.total.limit
from 95% to a higher value. In our experiments we've seen though that if we set the threshold too high, we end up in situations where the circuit breaker is ineffective because it triggers too late. - Turn off the real memory circuit breaker by setting
indices.breaker.total.use_real_memory
tofalse
. I'd advise against it because this increases your risk of running into out of memory errors by overloading the node. The real memory circuit breaker enables the node to actually push back without falling over if too much traffic is hitting the node.
Daniel