We have a heterogeneous cluster setup where a large set of powerful
hot data nodes ingest daily data live, and store about 5 days worth of data (~15tb or so). Then after 5 days we migrate the data to a smaller set of less powerful but larger-in-storage
warm nodes. These warm nodes maintain data for 25 days before we destroy it.
warm nodes are
d2.2xl instances (60gb memory, 12TB local storage) and we dedicate half the memory to ES (so 30gb). Each node is storing roughly ~5tb of data, and we have 14 nodes right now. Over 14 nodes we have 25 days of indices (
* 2 because we store two types of data each day). This nets out to about
700 shards and about
70tb of raw data across the machines. From a performance perspective this is fine because this data is generally not searched frequently, so we don't mind if the queries are slow.
The problem we've got is that the JVM on each of these machines is running right at about 80% used, and there is no garbage collection jigsaw pattern .. the machines just hold steady at 80%. After digging, we realized that it looks like we're storing an average of
22gb of data in memory for terms:
So, the ultimate question here .. is it possible for me to tune ES to store fewer terms in the JVM heap? Or is this a simple limitation .. the more data we have, the more heap size we simply have to have available to us? Can we not make some kind of performance tradeoff here?