Hi,
I’m running Elasticsearch 9.0.3 managed by ECK on AKS, and I’m seeing persistent high JVM heap usage on my warm nodes.
Cluster topology
-
3 × master nodes
-
2 × hot data nodes
-
2 × warm data nodes
-
Persistent volumes per data node
-
JVM heap on warm nodes: 2.5 GB
Workload
-
APM data streams:
-
traces-apm-*
-
logs-apm.*
-
metrics-apm.*
-
-
Traces volume: 20+ GB per day
-
Logs/metrics: typically tens of MB per day
-
Rollover currently happens daily (or at ~50 GB)
ILM (current)
-
hot → warm after ~8 days
-
delete after 180 days
-
replicas = 0 (temporarily set on warm to reduce pressure)
Problem
-
One warm node currently holds ~1 TB of data and ~950 shards
-
JVM heap usage on that node stays around 92–93%
-
Heap pressure appears to be driven by shard/segment overhead rather than fielddata
Question
I’m considering splitting ILM into two policies:
-
Traces policy
-
rollover:
50GB OR 1d -
hot → warm after ~8 days
-
delete after 180 days
-
-
Logs/metrics policy
-
rollover:
5GB OR 10–30d -
hot → warm after ~8 days
-
delete after 180 days
-
Is this a recommended approach for APM-heavy clusters?
Thanks in advance for any guidance or real-world experience.