Hello there,
Recently we've begun tinkering with a peculiar dataset named endpoint.metrics
, which reports a number of performance metrics once a day (it seems) and upon every policy change.
For what it's worth, we are using version 8.14
of the stack (agent version 8.14.1
currently).
There seems not to be much information available around the Internet about this dataset, and on the other hand it's being super useful to debug a few nasty performance issues that we've encountered on a small number of assets within our distributed fleet. For example, in one specific case it allowed us to surmise that the excessive system load we were experiencing on a Postgres server was caused by an enormous amount of authentication attempts, which suggested us to temporarily activate additional application logging, which allowed us to capture events about a missing Postgres role, which... You get the idea: it's an extremely useful source of information when it comes to attempting to understand where things are going wrong.
But we currently do not actually understand the actual meaning of some of those fields. We only deduced by experimental evidence that "specific number goes up → performance goes down → bad things", which isn't really the most ideal workflow for troubleshooting.
We're currently facing some doubts concerining the meaning of some of those metrics. Namely:
- Do the fields
week_ms
andweek_idle_ms
represent the total time, in milliseconds, the agent spent active, either doing something (week_ms
) or waiting for something (week_idle_ms
), over the last 7-days window? - How should the
cpu.consumption
metrics be interpreted? Are those numbers within the array an average? A mean? The current consumption? Are these numbers absolute values to be interpreted in some particular way, or just a series of percentages? - Does the field
Endpoint.metrics.documents_volume.*.suppressed_count
represent the amount of documents that the agent has produced locally, but not forwarded to the central SIEM due to whatever reason (e.g.: event filters, or simply the fact that a particular telemetry collection is specifically deactivated)? - The
Endpoint.metrics.cpu.endpoint.histogram.counts
field, as well as its adjacentvalues
field, always seem to contain 20 entries. What is the correct way to interpret those fields? Thevalues
do resemble a set of "percentage steps", going by 5, but it's unclear to me what those values specifically represent. - Concerning the
Endpoint.metrics.threads
array, the general meaning seems clear to me, I'm just wondering what's the correct way to interpret the innermean
field: does it represent the mean absolute-percentage-of-CPU-time-usage of the individual feature with respect to the global (e.g.: 0.0003 = 0.0003% of the total system CPU time), or is there another way to interpret those numbers?
Is there any specific documentation I might have missed, aside from the one present within the Endpoint GitHub repository?
Thanks for your help and your work!