One of our elasticsearch clusters is showing an unexpected behavior on the heap of its warm nodes. The cluster consists of 40 hot data nodes (8 cores, 64gb ram out of which 30.5gb heap and 1.5tb of local SSD storage), and 30 warm data nodes (8 cores, 32gb ram out of which 16gb heap and 3tb of attached HDD storage).
Briefly, it looks like the node is accumulating some kind of temporary objects (query caching / shard aggregation data / inverted indices etc.), eventually fills up all the heap, and then suddenly drops some of this temporary data and clearing several GBs. It is easily evident in the following 7-day query of the heap size of one of the machines:
The temporary object loading policy and GC policy creating this behavior is not clear to us, and we are sure they can (and should) be optimized.
We'd like to check what data is used to fill up the heap, what elasticsearch parameters control the heap loading policy of this data type and what optimizations can be done to get a more consistent behavior of each of the nodes.
For reference, here is a previous discussion about our warm storage's heap:
Any help is appreciated.