Information about the endpoint.metrics dataset

Hello there,

Recently we've begun tinkering with a peculiar dataset named endpoint.metrics, which reports a number of performance metrics once a day (it seems) and upon every policy change.

For what it's worth, we are using version 8.14 of the stack (agent version 8.14.1 currently).

There seems not to be much information available around the Internet about this dataset, and on the other hand it's being super useful to debug a few nasty performance issues that we've encountered on a small number of assets within our distributed fleet. For example, in one specific case it allowed us to surmise that the excessive system load we were experiencing on a Postgres server was caused by an enormous amount of authentication attempts, which suggested us to temporarily activate additional application logging, which allowed us to capture events about a missing Postgres role, which... You get the idea: it's an extremely useful source of information when it comes to attempting to understand where things are going wrong.

But we currently do not actually understand the actual meaning of some of those fields. We only deduced by experimental evidence that "specific number goes up → performance goes down → bad things", which isn't really the most ideal workflow for troubleshooting.

We're currently facing some doubts concerining the meaning of some of those metrics. Namely:

  1. Do the fields week_ms and week_idle_ms represent the total time, in milliseconds, the agent spent active, either doing something (week_ms) or waiting for something (week_idle_ms), over the last 7-days window?
  2. How should the cpu.consumption metrics be interpreted? Are those numbers within the array an average? A mean? The current consumption? Are these numbers absolute values to be interpreted in some particular way, or just a series of percentages?
  3. Does the field Endpoint.metrics.documents_volume.*.suppressed_count represent the amount of documents that the agent has produced locally, but not forwarded to the central SIEM due to whatever reason (e.g.: event filters, or simply the fact that a particular telemetry collection is specifically deactivated)?
  4. The Endpoint.metrics.cpu.endpoint.histogram.counts field, as well as its adjacent values field, always seem to contain 20 entries. What is the correct way to interpret those fields? The values do resemble a set of "percentage steps", going by 5, but it's unclear to me what those values specifically represent.
  5. Concerning the Endpoint.metrics.threads array, the general meaning seems clear to me, I'm just wondering what's the correct way to interpret the inner mean field: does it represent the mean absolute-percentage-of-CPU-time-usage of the individual feature with respect to the global (e.g.: 0.0003 = 0.0003% of the total system CPU time), or is there another way to interpret those numbers?

Is there any specific documentation I might have missed, aside from the one present within the Endpoint GitHub repository?

Thanks for your help and your work!

1 Like

Removed elastic-stack-monitoring

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.