Issue with Inaccurate CPU/Memory Stats Visualization in ELK for QA Environment

In our QA environment, we observed that both CPU and memory usage display a flatline on ELK, even under load conditions. The expectation is that these metrics should reflect an increase in resource utilization as load is applied. However, they appear static, and we are unsure whether this is due to ELK’s aggregation logic or if we might be misinterpreting how these metrics should be handled.

We suspect that ELK’s default visualization settings and aggregation methods may be responsible for this issue. Since ELK controls the data aggregations and formulas applied to these metrics for display purposes, we lack visibility into and control over the underlying configuration. We are exporting host metrics via OpenTelemetry to ELK and can confirm the data is generated and sent correctly, as confirmed through OpenTelemetry Collector logs. However, we cannot directly modify ELK’s graph visualization logic to influence how the metrics appear.

Observations:

  1. ELK does not allow configuring the names or formulas for metrics displayed in these default visualizations.
  2. In the Metrics Explorer tab, using KQL to filter metrics by names such as system.memory.usage and system.cpu.utilization provides expected dynamic visualizations.
  3. We have verified that host metrics data is generated and exported accurately by the OpenTelemetry Collector, but cannot control how ELK aggregates and displays these metrics.
  4. Version Differences:
  • QA Environment: ELK v8.13.4
  • Dev Environment: ELK v8.13.2

Attachments: We have included screenshots of both CPU and memory usage graphs, along with relevant portions of OpenTelemetry Collector logs.

Questions:

  1. Is this static visualization behavior a known issue, or could it be a version-specific problem related to differences between ELK v8.13.2 (Dev) and ELK v8.13.4 (QA)?
  2. Could ELK’s aggregation method or visualization approach cause the discrepancy in average CPU and memory metrics?
  3. Are we missing any configuration steps or misinterpreting the data, especially in how average CPU and memory metrics are expected to be calculated or displayed?

Thank you for your assistance in helping us understand if this is a configuration oversight on our part or a visualization limitation within ELK. We look forward to any insights you might have regarding this issue.

An example of collector logs is below
Metric #2
Descriptor:
-> Name: system.memory.usage
-> Description: System memory usage
-> Unit: bytes
-> DataType: Gauge
NumberDataPoints #0
Data point attributes:
-> state: Str(used)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-10-28 11:33:39.749842736 +0000 UTC
Value: 6385209344
NumberDataPoints #1
Data point attributes:
-> state: Str(free)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-10-28 11:33:39.749842736 +0000 UTC
Value: 19615633408
NumberDataPoints #2
Data point attributes:
-> state: Str(cached)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-10-28 11:33:39.749842736 +0000 UTC
Value: 6845730816
Metric #3
Descriptor:
-> Name: system.memory.utilization
-> Description: System memory utilization
-> Unit: 1
-> DataType: Gauge
NumberDataPoints #0
Data point attributes:
-> state: Str(used)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-10-28 11:33:39.749842736 +0000 UTC
Value: 0.189738
NumberDataPoints #1
Data point attributes:
-> state: Str(free)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-10-28 11:33:39.749842736 +0000 UTC
Value: 0.582884
NumberDataPoints #2
Data point attributes:
-> state: Str(cached)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2024-10-28 11:33:39.749842736 +0000 UTC
Value: 0.203423

The average when calculated manually comes out to be between 25-27%. However, on elk, it is around 32-33% which seems incorrect.