ILM policy for indices from APM server

Hi,

I’m running Elasticsearch 9.0.3 managed by ECK on AKS, and I’m seeing persistent high JVM heap usage on my warm nodes.

Cluster topology

  • 3 × master nodes

  • 2 × hot data nodes

  • 2 × warm data nodes

  • Persistent volumes per data node

  • JVM heap on warm nodes: 2.5 GB

Workload

  • APM data streams:

    • traces-apm-*

    • logs-apm.*

    • metrics-apm.*

  • Traces volume: 20+ GB per day

  • Logs/metrics: typically tens of MB per day

  • Rollover currently happens daily (or at ~50 GB)

ILM (current)

  • hot → warm after ~8 days

  • delete after 180 days

  • replicas = 0 (temporarily set on warm to reduce pressure)

Problem

  • One warm node currently holds ~1 TB of data and ~950 shards

  • JVM heap usage on that node stays around 92–93%

  • Heap pressure appears to be driven by shard/segment overhead rather than fielddata

Question
I’m considering splitting ILM into two policies:

  1. Traces policy

    • rollover: 50GB OR 1d

    • hot → warm after ~8 days

    • delete after 180 days

  2. Logs/metrics policy

    • rollover: 5GB OR 10–30d

    • hot → warm after ~8 days

    • delete after 180 days

Is this a recommended approach for APM-heavy clusters?

Thanks in advance for any guidance or real-world experience.

Yes, splitting ILM policies by data type (traces vs logs/metrics) is a recommended and proven approach for APM-heavy clusters.

Your main issue is shard and segment overhead on warm nodes, not data volume itself. Traces generate far more shards and segments than logs/metrics, so isolating them with a dedicated ILM policy is the correct move.

Key additional recommendations:

Reduce shard count aggressively for traces (fewer, larger shards).

Consider a shorter hot phase for traces.

Apply forcemerge (max 1 segment) before or during warm transition.

Avoid long retention on warm for traces unless required.

If possible, increase heap on warm nodes or add one more warm node.

Overall: Yes, your approach is correct, but shard reduction and segment consolidation are the real fixes.

1 Like

@Rafa_Silva Thanks a lot for the response
This is very helpful and confirms what I was suspecting

1 Like