High heap usage, old GC does not run consistently

cross post from use_real_memory setting causes GC collection issues · Issue #62278 · elastic/elasticsearch · GitHub per request.

Elasticsearch version (bin/elasticsearch --version): 7.10.0

Plugins installed:

JVM version (java -version): openjdk version "11.0.5" 2019-10-15 LTS

OS version (uname -a if on a Unix-like system): rhel 7

We're trying to upgrade from 5 to 7 (quite late, yes yes). Our old cluster was 1 node, and our new one is 3. So there really is no reason why this new setup can't handle the old's data. I set up our ingestion pipeline to write to the new cluster, without copying any old data, and after a week or two it hit this issue. I cleared the cluster, and I tried copying over the smaller indices via python's bulk helper, and I could consistently hit this issue. It looks like only one node is hitting the limit, but I don't see the old heap GC run basically ever.

My manual copy test was roughly 5 GB of data before I ran into the issue consistently (I did not manage to copy everything). There are 30 indices, usually set up with 5 primary 1 replica (should probably make that 3 primary or less for this test cluster, but I don't think it'd break things this bad). Each node has 80 to 90 segments. In theory, there should be tons of runway here, there's no reason for a node to be using this much heap.

Our circuit breaker was 50% when I first saw the issue (hold over from our 5 config). I raised it to 75%, then did that manual copy test, and I still see it consistently.

I was not writing to any node in particular, afaik, we've set up an alias that should do round robin over the nodes.

Notice we are using CMS (default jvm.options + java 11 = CMS). This thread has a similar issue, but everyone fixed it by moving to CMS.

I then set use_real_memory to false and turned the circuit breaker back to 50%. It was able to handle this, and old GC does run, but the heap usage is still very high for this 9.2 gb of data I ended up testing with (we have a index that is a few hundred gb I did not test with since copying that would take forever).


I didn't pull stats directly here, but this is the graph from when we set up the cluster to when we noticed the issue. This is probably just handful of GB (less than 10) over these weeks; we didn't backfill our data, this is just changes to our data.

Screen Shot 2021-02-23 at 12.37.30 PM
Heap usage eventually gets to 50%. Bulk API stats throwing circuit breaking exceptions.


While testing with use_real_memory true:

Scrubbed _nodes/jvm _nodes/jvm · GitHub
Scrubbed node stats scrubbed _nodes/stats/jvm · GitHub

Heap usage was at this level for days after the test. Old GC did not run.

Screen Shot 2021-02-23 at 12.35.15 PM
Node was stuck at 50% circuit breaker. Restarted with 75%, then ran my test and hit the breaker with bulk API. 2 nodes go back down, but one node stays hot. Even querying _nodes/stats/jvm can cause the circuit breaker exception.


After setting to false:

Scrubbed _nodes/jvm: _nodes/jvm with use_real_memory false · GitHub
Scrubbed node stats: _nodes/statswith use_real_memory false · GitHub

Heap usage is still very high for this amount of data, afaik, but I can actually see it GC in some graphs. Circuit breaker was never triggered, despite the heap going over the limit of 50%. I assume that the stats here are using a different calculation than the circuit breaker is.

Note that after my experiment, memory usage is pretty constant. Not sure if this is expected.

Screen Shot 2021-02-23 at 12.32.27 PM
You can see I restarted the cluster, heap went to basically 0, then I started the test, after which heap goes up and stays up.


I also tried enabling more data going to our cluster. Our job ran here, and ingested 32 GB of data. The circuit breakers weren't hit and looks like GC is running. So maybe the high-heap-usage-with-no-traffic is expected? It also looks like our jobs didn't have errors with the bulk API here, so we didn't trip circuit breakers (50% with use_real_memory false still).

Screen Shot 2021-02-23 at 12.42.38 PM


Let me know if you need more information, I'd be happy to provide it. I'm also not sure what is expected here, maybe heap usage is meant to be quite high with ES and GC only run infrequently. It at least seems like with 50% circuit breaker and use_real_memory is false, the cluster is usable, even if heap does seem high.

Thanks for opening this here. I've been mulling this one over since I read your post on GitHub. My initial reaction to the issue is that there is a configuration conflict with using the real memory circuit breaker, the CMS settings shipped by default, and the circuit breaker limit.

The issue as I understand it is that the configuration options for the JVM that we ship conflict with the setting of the breaker limit for real memory. In the scenario that you describe, the CMS collector will only start an old GC once the old space is 75% full due to the CMSInitiatingOccupancyFraction and UseCMSInitiatingOccupancyOnly JVM config option. However, the JVM may never be able to run an old GC if the circuit breaker keeps stopping execution. If this is indeed the case, then I agree that we should update the circuit breaker to handle this scenario better as the circuit breaker is hurting the situation more than helping.

Would you be willing to run a test with use_real_memory set to true and the circuit breaker set to 77% or higher? Based on the node stats output, I calculated roughly what 75% of old gen + the max of young and survivor spaces is and I believe that by not circuit breaking until 77% of heap is used then this will allow the CMS old collector to eventually be triggered and reclaim heap space.

If you want to see the heap usage go down, you could consider commenting out the UseCMSInitiatingOccupancyOnly jvm option line and/or modifying the CMSInitiatingOccupancyFraction value. However, I still believe that these are good defaults.

It might be worthwhile resetting the settings indices.breaker.total.limit and use_real_memory to defaults (95% and true). As Jay explained, CMS is likely to let heap grow way beyond 75%, but should clear out before 95%.

Notice that the real memory circuit breaker (default in 7.x) works very differently from the non-real memory circuit breaker. Real memory circuit breaker accounts for all allocated memory (since it uses the JVM stats), whereas the non-real memory circuit breaker only accounts for the places where we instrumented our code to account for it. In turn this means that lowering the setting from the default 95% should normally not be necessary with real memory circuit breaker on, whereas tuning this setting was more frequently necessary on a 5.x cluster.

I've just reset things to their defaults (not passing in config value; so should be 95%/true). Looks like the cluster is working fine. I guess we should learn to respect the defaults :slight_smile: Thanks

Screen Shot 2021-02-25 at 5.20.24 PM
I restarted the cluster to apply the new settings, let it run for a few hours, and then applied my copy-the-old-data-test. GC seems to be running just fine, and max heap usage, according to this so far, has been 72%, though I don't think the is is exact same number as reported by _nodes/stats/jvm, so that 77% figure from before might be more accurate.

Thanks!

Though I guess some improvement could still be warranted here? Can GC be more aware of the circuit breaker try to run before it hits?