I’m running an EFK stack in an RKE2 cluster on RHEL 8.5. The VMs are hosted in vSphere and use centralized NetApp storage.
Our Elasticsearch footprint currently looks like this:
2 master nodes
2 client nodes
2 hot data pods
We monitor pod throughput in Grafana, and we’ve been noticing that after the hot pods have been running for a while, the network eventually becomes saturated with what appears to be traffic to the NetApp storage.
What’s odd is that this doesn’t happen immediately. Sometimes it starts after about a day, other times it may take a week. But once it starts, the hot pods generate sustained read traffic for hours or even days. Before it starts its around 10-150MB/s
What we’ve observed so far:
The traffic appears to be mostly read traffic, not write traffic.
The duration seems related to max_primary_shard_size.
The smaller the shard size, the longer the sustained activity seems to continue.
Current config:
eck-apps:
enabled: true
elasticsearch:
externalLoggingEnabled: false
master:
replicas: 2
storage: 6Gi
memory: 6Gi
dataHot:
replicas: 1
storage: 50Gi
memory: 8Gi
cpu: 1500m
client:
replicas: 2
storage: 6Gi
memory: 2Gi
clientExternal:
replicas: 2
memory: 4Gi
logging:
number_of_shards: 2
number_of_replicas: 0
rollover:
max_age: "1d"
max_primary_shard_size: "20GB"
shrink:
number_of_shards: 1
delete:
min_age: "1d"
My questions are:
Has anyone else seen this kind of behavior?
Is there a known trigger for this kind of long-running sustained read activity?
Does this sound like it could be tied to ILM rollover/shrink, segment merges, or shard relocation?
Note: This is still happening if I run it with only 1 Hot pod and 0 logging replicas.
Why would the traffic be so heavily read-oriented, especially against backend storage?
Any insight would be appreciated.
