Information about the endpoint.metrics dataset

popeio · July 26, 2024, 3:23pm

Hello there,

Recently we've begun tinkering with a peculiar dataset named endpoint.metrics, which reports a number of performance metrics once a day (it seems) and upon every policy change.

For what it's worth, we are using version 8.14 of the stack (agent version 8.14.1 currently).

There seems not to be much information available around the Internet about this dataset, and on the other hand it's being super useful to debug a few nasty performance issues that we've encountered on a small number of assets within our distributed fleet. For example, in one specific case it allowed us to surmise that the excessive system load we were experiencing on a Postgres server was caused by an enormous amount of authentication attempts, which suggested us to temporarily activate additional application logging, which allowed us to capture events about a missing Postgres role, which... You get the idea: it's an extremely useful source of information when it comes to attempting to understand where things are going wrong.

But we currently do not actually understand the actual meaning of some of those fields. We only deduced by experimental evidence that "specific number goes up → performance goes down → bad things", which isn't really the most ideal workflow for troubleshooting.

We're currently facing some doubts concerining the meaning of some of those metrics. Namely:

Do the fields week_ms and week_idle_ms represent the total time, in milliseconds, the agent spent active, either doing something (week_ms) or waiting for something (week_idle_ms), over the last 7-days window?
How should the cpu.consumption metrics be interpreted? Are those numbers within the array an average? A mean? The current consumption? Are these numbers absolute values to be interpreted in some particular way, or just a series of percentages?
Does the field Endpoint.metrics.documents_volume.*.suppressed_count represent the amount of documents that the agent has produced locally, but not forwarded to the central SIEM due to whatever reason (e.g.: event filters, or simply the fact that a particular telemetry collection is specifically deactivated)?
The Endpoint.metrics.cpu.endpoint.histogram.counts field, as well as its adjacent values field, always seem to contain 20 entries. What is the correct way to interpret those fields? The values do resemble a set of "percentage steps", going by 5, but it's unclear to me what those values specifically represent.
Concerning the Endpoint.metrics.threads array, the general meaning seems clear to me, I'm just wondering what's the correct way to interpret the inner mean field: does it represent the mean absolute-percentage-of-CPU-time-usage of the individual feature with respect to the global (e.g.: 0.0003 = 0.0003% of the total system CPU time), or is there another way to interpret those numbers?

Is there any specific documentation I might have missed, aside from the one present within the Endpoint GitHub repository?

Thanks for your help and your work!

popeio · July 26, 2024, 3:25pm

Removed elastic-stack-monitoring

system · August 23, 2024, 3:26pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Endpoint Security Data (Rollup?) Elastic Security docker	5	1009	June 30, 2022
Endpoint agent consistent 90+% CPU for some PCs Endpoint Security	16	11971	March 17, 2021
Transactions per hour elastic agent Elasticsearch elastic-agent	3	463	October 15, 2020
Elastic agent and SvcHost DnsCache very high CPU usage Endpoint Security elastic-agent	9	1364	June 28, 2022
Elastic Endpoint Expected CPU Usage Endpoint Security	7	1848	January 30, 2021

Information about the endpoint.metrics dataset

Related topics