Bug in Metricbeat Prometheus module causes _count and _sum of summary typed to be labeled with quantile

smarsching · January 17, 2025, 10:20pm

I believe that I found a bug in the Prometheus module of Metricbeat. Consider the following Prometheus metric of type summary:

# HELP cassandra_client_request_latency_seconds Request latency.
# TYPE cassandra_client_request_latency_seconds summary
cassandra_client_request_latency_seconds_sum{operation="write",consistency="TWO",cassandra_cluster="Test Cluster",cassandra_node="127.0.0.1",cassandra_datacenter="datacenter1",cassandra_rack="rack1"} 0.0 1737149678230
cassandra_client_request_latency_seconds_count{operation="write",consistency="TWO",cassandra_cluster="Test Cluster",cassandra_node="127.0.0.1",cassandra_datacenter="datacenter1",cassandra_rack="rack1"} 0.0 1737149678230
cassandra_client_request_latency_seconds{operation="write",consistency="TWO",quantile="0.5",cassandra_cluster="Test Cluster",cassandra_node="127.0.0.1",cassandra_datacenter="datacenter1",cassandra_rack="rack1"} 0.0 1737149678230
cassandra_client_request_latency_seconds{operation="write",consistency="TWO",quantile="0.75",cassandra_cluster="Test Cluster",cassandra_node="127.0.0.1",cassandra_datacenter="datacenter1",cassandra_rack="rack1"} 0.0 1737149678230
cassandra_client_request_latency_seconds{operation="write",consistency="TWO",quantile="0.95",cassandra_cluster="Test Cluster",cassandra_node="127.0.0.1",cassandra_datacenter="datacenter1",cassandra_rack="rack1"} 0.0 1737149678230
cassandra_client_request_latency_seconds{operation="write",consistency="TWO",quantile="0.98",cassandra_cluster="Test Cluster",cassandra_node="127.0.0.1",cassandra_datacenter="datacenter1",cassandra_rack="rack1"} 0.0 1737149678230
cassandra_client_request_latency_seconds{operation="write",consistency="TWO",quantile="0.99",cassandra_cluster="Test Cluster",cassandra_node="127.0.0.1",cassandra_datacenter="datacenter1",cassandra_rack="rack1"} 0.0 1737149678230
cassandra_client_request_latency_seconds{operation="write",consistency="TWO",quantile="0.999",cassandra_cluster="Test Cluster",cassandra_node="127.0.0.1",cassandra_datacenter="datacenter1",cassandra_rack="rack1"} 0.0 1737149678230

The Prometheus module correctly generates an event for each of the quantiles of cassandra_client_request_latency_seconds, adding the correct quantile label to each event.

However, it incorrectly adds cassandra_client_request_latency_seconds_count and cassandra_client_request_latency_seconds_sum to the event for the last quantile, while these two values actually are not associated with any quantile.

I inspected the relevant code and believe that the origin of the bug is in metricbeat/helper/prometheus/textparse.go, specifically in the ParseMetricFamilies function.

The code for processing the labels looks like this:

		var lbls strings.Builder
		lbls.Grow(len(mets))
		var labelPairs = []*labels.Label{}
		var qv string // value of le or quantile label
		for _, l := range lset.Copy() {
			if l.Name == labels.MetricName {
				continue
			}

			if l.Name == model.QuantileLabel {
				qv = lset.Get(model.QuantileLabel)
			} else if l.Name == labels.BucketLabel {
				qv = lset.Get(labels.BucketLabel)
			} else {
				lbls.WriteString(l.Name)
				lbls.WriteString(l.Value)
			}

			n := l.Name
			v := l.Value

			labelPairs = append(labelPairs, &labels.Label{
				Name:  n,
				Value: v,
			})
		}

lbls is used for finding the correct metric in summaryMetricName(…). and the quantile label is correctly excluded there, so that the count, the sum, and the values for all quantiles are associated with the same metric. However, the quantile label is not excluded when generating the labelPairs). Later in the code, labelPairs` is used in the following way when processing a metry of type summary:

		case model.MetricTypeSummary:
			lookupMetricName, metric = summaryMetricName(metricName, v, qv, lbls.String(), summariesByName)
			metric.Label = labelPairs
			if !isSum(metricName) {
				// Avoid registering the metric multiple times.
				continue
			}

So metric.Label is updated with the most recent labelPairs, but because the count, the sum, and the values of all quantiles share the same metric and thus the same metric.Label, this means that all but one of them will now have an incorrect label.

For the quantile values, this behavior does not cause any problems, because the GeneratePromEvents{…) function in metricbeat/module/prometheus/collector/data.go overwrites the quantile label for the quantile values. But it does not remove such a label (if it is present) for the count and sum values, so that they receive the incorrect label.

This bug only appears when at one of the quantile values comes last in the input from Prometheus. If the _count or _sum value comes last, the labels from that value are used and thus the quantile label is not going to be included.

I believe that this bug can be fixed quite easily. The code that processes the labels (as mentioned) earlier simply has to be changed to not add the quantile (or le in the case of histograms) label to labelPairs. So, the correct code should look like this:

		var lbls strings.Builder
		lbls.Grow(len(mets))
		var labelPairs = []*labels.Label{}
		var qv string // value of le or quantile label
		for _, l := range lset.Copy() {
			if l.Name == labels.MetricName {
				continue
			}

			if l.Name == model.QuantileLabel {
				qv = lset.Get(model.QuantileLabel)
			} else if l.Name == labels.BucketLabel {
				qv = lset.Get(labels.BucketLabel)
			} else {
				lbls.WriteString(l.Name)
				lbls.WriteString(l.Value)

				n := l.Name
				v := l.Value
				labelPairs = append(labelPairs, &labels.Label{
					Name:  n,
					Value: v,
				})
			}
		}

Can somebody please confirm this bug, so that I can open an issue in the Beats project on GitHub?

Topic		Replies	Views
Prometheus Integration not scraping summary metrics Beats metricbeat	6	339	April 9, 2024
Prometheus module stop working with new metrics format Beats metricbeat	2	653	October 22, 2019
Prometheus module doesn't work with WildFly 18 Beats metricbeat	5	816	December 16, 2019
Metricbeat and advanced Prometheus queries Beats metricbeat	8	1467	April 28, 2020
Problem with Prometheus Module Beats metricbeat	2	1874	July 31, 2019

Bug in Metricbeat Prometheus module causes _count and _sum of summary typed to be labeled with quantile

Related topics