Fresh install or upgraded from other version? upgraded from 7.17.1 to 7.17.9
Is there anything special in your setup? We use Kafka, Logstash, we have our index patterns and custom templates
Description of the problem including expected versus actual behavior. Please include screenshots (if relevant): I have a problem with a very big amount of data logged in .ds-metrics-apm.internal-default on our cluster. We are after a transition from stand-alone apm server into data streams (Elastic Agent - Fleet managed).
After that transition in this index .ds.metrics-apm.internal-default there are tons of logs for agent. name - dotnet
For the last 24 h were 238,228,987 hits
We observed those problems with a lot of data after we change fleet settings to experimental:true. After this, we have a very high CPU usage on our nodes.
That's how we found so much data here .ds-metrics-apm.internal-default.
Sampling is not supported for the internal metrics.
The internal metrics are by default powering the APM UI from 8.0 onwards. What the server does is that it basically aggregates raw events into aggregated metrics for a certain time interval. The interval is set to 1 minute by default in 7.17. The aggregated metrics should actually act as kind of a rollup for raw data, allowing to delete raw data earlier without losing historical key metrics, such as TPM.
The question is why is there such a large amount of aggregated metrics events. The metrics are aggregated along several dimensions, and maybe your services are instrumented to set high cardinality values where we do not expect them, e.g. if there is a random ID part of the service.name. This would lead to hitting aggregation bucket limits, and ultimately lead to issuing one metrics event per apm event. (This behavior is improved in 8.x).
Could you take a look at the apm-server logs and see if there are any errors logged?
After looking at this again, I actually realised that your screenshot shows span_breakdown metrics. Let me change my response accordingly:
These breakdown metrics are used for populating the Time spent by span type graph, see details. The metrics were introduced before 7.17 and are unrelated to the usage of elastic agent or the experimental output.
In 7.x the sampling rate is not applied to transaction breakdown metrics, as any transaction documents are generally retained, and sampling is only applied to span data, see what data is sampled?.
From 8.0 onwards, the sampling logic and which data are retained vs. which are calculated has changed.
In summary, if you decide to remove the data in this index (e.g. via ingest pipeline), the Time spent by span type visualization is not going to work anymore.
Thanks for all your answers.
We have this on our plans for this year. But we can switch to 8x just like that.
It is a whole process to check if everything working ok on lower env clusters and after that, we can switch the prod cluster to 8x
We will try with ingest pipeline first and maybe we will use our own "sampling" to get only 10 % of the data and 90 % will be dropped.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.