Long-term, low query logs storage in Elasticsearch 5.6 Cluster - what are the risks of using more than 50% of RAM to heap?

Hello! We are trying to optimize our ES 5.6 cluster to use less heap memory.
We have 19Tb of data, 9bln of rows in ~560 shards on 2 warm, 4 cold data nodes, which are only rarely queried. Memory is scarce resource in our private cloud, so next measures were taken:

-- mappings optimized
-- on cold nodes, indices are force merged and than shrunk to 1 shard in index, resulting in 60-90Gb per shard
-- indexing of new data takes on place on "hot" nodes
-- cold node machines are 32G of ram, and recently (as we hit GC issues again due to constant high heap usage), nodes were reconfigured to use 20Gb of heap.

What are the risk of using 60% of RAM to Java heap? Can we move further with that? The cluster receives ~10 queries daily, and users can wait some extra time.

Perhaps there are more options? Unfortunately, upgrade to 6.8 and using frozen indices is not an option at the moment.

I do believe you should shrink indices before forcemerging then in order to get the full benefits, not the other way around. You can also try freezing older indices as that will further reduce heap usage, but make sure they have been forcemerge as the final step just before freezing.

The main risk here IMO is that the process tries to use up to 2x60% = 120% of the available RAM and gets killed by the OS as a consequence. The 5.6 docs don't mention this explicitly, I think, but the current docs on setting the heap size note that you should expect Elasticsearch to consume more memory than the configured heap size. The actual limit is approximately twice the heap size, hence the recommendation to keep the heap size below 50% of the available RAM.

You could also try closing indices and only opening them when needed; this is effectively what frozen indices does for you under the hood.

Note that 5.6 is almost a year past the end of its life, and upgrading is strongly recommended at this point. As well as frozen indices we've done a lot of work since 5.6 on streamlining memory usage (e.g. https://issues.apache.org/jira/browse/LUCENE-8635) and on working better in memory-constrained environments (e.g. https://github.com/elastic/elasticsearch/pull/31767) that sound like they will be of benefit to your cluster.

Thanks for your replies! We'll change the order of shrunk/forcemerge operation, and will closely monitor ES process size.
Yes, frozen indices are exact functionality we are missing, but our logging setup is quite complex, with many parts and we can't risk losing the logs while upgrading in-place...

We tried closing/opening approach, cluster starts to go yellow for tens of minutes if we open 10 indices simultaneously (some of those rare queries span 8+ indices). Perhaps, it was due to already saturated heap...

Again, thank you very much!

Are you sure these indices are synced flushed before closing? That should help them recover to green more quickly. Again, needless to say, there have been big improvements in this area in versions after 5.6.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.