As covered in the Memory Issues We'll Remember blog post, we’ve been spending plenty of time trying to optimize the memory footprint of clusters running Elasticsearch.
Collaborating with the Elasticsearch developers, one finding we took action on was reducing the JVM code cache size combined with disabling tiered compilation, as that can consume a large amount of memory, which we wanted to free up for Elasticsearch to use. Our testing found that even after the reduction, Elasticsearch typically kept below half the new code cache size. Performance was also marginally better.
As a result, we deployed this change and any new, modified or restarted clusters began to pick up the new flag and enjoy smaller code cache footprints.
This ran well for a while, until we saw a few clusters start using a lot of CPU seemingly randomly, and kept hogging the CPU.
In the first clusters we investigated, this correlated with large increases in load or buggy scripts hogging the CPU. This delayed a thorough investigation. Clusters being overloaded is fairly common.
Investigating the issue
During our investigation into the excessive CPU use, our engineers found what seems to be a bug in the JVM, where it goes into interpreted mode even though there is plenty code cache left. This causes the JVM to use most of its CPU on interpreting code and not on actually running it. This also invalidated results from hot threads, where whatever was going on seemed to take a lot of CPU.
We found that about 1% of clusters were affected by this overall, most of them taking days or even weeks of running normally before the issue manifested. Some clusters would quickly run into this issue, making this a critical issue.
Findings and future-proofing
We have backed out the code cache related changes, and cycled the nodes of clusters with the reported issue.
We’re trying to synthesise this issue to study the potential JVM bug, in addition to changing the burn in procedures for changes of this nature, and to protect against potential future JVM bugs. We will follow up with any relevant information that comes out of our efforts looking into profiling code caching.
Any new, modified or restarted clusters will also pick up this change and we’ll be running through the fleet to spot any nodes burning too much CPU and get them restarted as well as any node susceptible to the problem. If your cluster has this issue, you can reboot it to get it back to normal. This typically takes about a minute.
We deeply appreciate the trust you put in us as users of our service, and we sincerely apologize for the disruption this issue caused those of you affected. We are always learning and improving, and your feedback and patience is a big part of that.