Maybe you have a cluster that is affected by memory issues. Maybe you read some of the Discuss posts about clusters running out of memory on Elastic Cloud, especially 1GB clusters using Elasticsearch 5.x.
We know that there is an issue and we’re working on improving the user experience for the clusters that are affected.
The issue is more prevalent on 5.x clusters, but some OOM issues have also affected 2.x clusters. Here’s what you might have seen:
Repeated OOMs and cluster restarts, with frequent emails telling you that there was a “Node Restart Due to Running out of Memory,” indicating that nodes are continuously restarting due to memory pressure.
Cluster nodes indicate high memory pressure even when they are not handling a lot of work. Memory pressure is higher than expected when a cluster node is started and gets worse from there. Once OOMs begin, they tend to repeat in a time-based pattern until some kind of action is taken, often with help from us. Sometimes, tiebreaker nodes OOM, too.
What is the Elastic Cloud team doing to fix this?
There's a lot of work going on in the background investigating why these memory issues are cropping up, with fixes rolling out incrementally as we track down culprits and test fixes. We’re working directly with the Elasticsearch development team to improve memory handling. We’re also testing different Java tuning options on Elastic Cloud, and we’ve already bumped the memory to Java heap ratio for tiebreakers. In short, we’ve got our best rocket scientists in white lab coats working on the problem.
Getting these changes just right for the many different clusters hosted on Elastic Cloud can take a bit of time, and we appreciate your patience. Our goal is to improve the user experience right down to the smallest 1GB cluster. As part of our root cause analysis, we’ve already identified some memory leaks in Elasticsearch and Lucene, which are now being fixed.
Is there anything I can do to help?
Yes, there is! Here are a couple of suggestions that can help your clusters:
If you haven’t already done so, try upgrading to one of the latest versions, such as 5.1.2. Newer versions include some improvements that reduce the memory footprint. For example, Elasticsearch 5.1.2 changed a Netty setting that improved memory usage over earlier versions of 5.x.
Size your clusters appropriately. If you are running a smaller cluster that is continually maxed out, high memory pressure is likely hurting cluster performance and can cause node reboots. What worse, once your cluster is overwhelmed by memory pressure, any kind of resize operation will take much longer. So resize early and match your cluster to the workload it’s handling. (If you haven’t seen it yet, we also have some new guidance on Keeping Your Cluster Healthy.)
Thank you for reading and for your patience! Until we get this all sorted out, let us know if you’re having issues in this forum or work with our Support if you have a Gold of Platinum subscription. We’re here to help.
The Elastic Cloud team