Expected Heap Memory Requirements for node.data==false, node.master==false

Good Day!

I have a cluster with 3billion records, 2T data, 24 shards, and 1 replica. It is split into two zones: Zone 1 has a single machine, as the master node, with all 24 shards. Zone 2 has multiple nodes (usually 10) that share the replicas. This configuration was chosen to minimize AWS costs; with Zone 2 being spot instances. This cluster is optimized for low cost (as it is part of an experiment), and I am working on reducing the number of OoM errors. Search speed is not a priority at this time because are customers are internal and know what to expect in terms of latency.

The master node, being the only permanent node, gets all search requests, and has the highest heap requirements ( average peak of 25gigabytes). The other nodes only require 5gig or 10gig, depending on the number of shards they have. I would like to reduce the heap requirements on the master node. I have already converted all properties to doc_values==true for excellent heap reductions, but I want to get lower.

I was considering adding a permanent node to Zone2 (node.data==false, node.master==false) which will be used to accept all search requests. I want to do this because I believe the current configuration is not fully utilizing the shards in Zone2, rather preferring to use shards on master in Zone1. I am apprehensive about setting up this node because the OnDemand prices are much higher (5x, approximately) than the Spot prices.

If I setup a non-data node, how to set the heap settings? Do I continue to set heap to 50% of total memory, or can I crank it up to 90% (because there is no data, there are no Lucene indexes, and minimal need for drive cache)?

Comments? Suggestions?

Thank you!

You can increase heap to 75% (or more if you want to live on the edge) for client nodes, you don't have to worry about FS caching for these after all.

Does the 30.5GB rule hold true here, as well?


The master node, being the only permanent node, gets all search requests...

Isn't this risky? With increase in search load, the master node might observe memory issues and may not react to cluster membership events.

It sure is.