I believe that I found a bug in the Prometheus module of Metricbeat. Consider the following Prometheus metric of type summary:
# HELP cassandra_client_request_latency_seconds Request latency.
# TYPE cassandra_client_request_latency_seconds summary
cassandra_client_request_latency_seconds_sum{operation="write",consistency="TWO",cassandra_cluster="Test Cluster",cassandra_node="127.0.0.1",cassandra_datacenter="datacenter1",cassandra_rack="rack1"} 0.0 1737149678230
cassandra_client_request_latency_seconds_count{operation="write",consistency="TWO",cassandra_cluster="Test Cluster",cassandra_node="127.0.0.1",cassandra_datacenter="datacenter1",cassandra_rack="rack1"} 0.0 1737149678230
cassandra_client_request_latency_seconds{operation="write",consistency="TWO",quantile="0.5",cassandra_cluster="Test Cluster",cassandra_node="127.0.0.1",cassandra_datacenter="datacenter1",cassandra_rack="rack1"} 0.0 1737149678230
cassandra_client_request_latency_seconds{operation="write",consistency="TWO",quantile="0.75",cassandra_cluster="Test Cluster",cassandra_node="127.0.0.1",cassandra_datacenter="datacenter1",cassandra_rack="rack1"} 0.0 1737149678230
cassandra_client_request_latency_seconds{operation="write",consistency="TWO",quantile="0.95",cassandra_cluster="Test Cluster",cassandra_node="127.0.0.1",cassandra_datacenter="datacenter1",cassandra_rack="rack1"} 0.0 1737149678230
cassandra_client_request_latency_seconds{operation="write",consistency="TWO",quantile="0.98",cassandra_cluster="Test Cluster",cassandra_node="127.0.0.1",cassandra_datacenter="datacenter1",cassandra_rack="rack1"} 0.0 1737149678230
cassandra_client_request_latency_seconds{operation="write",consistency="TWO",quantile="0.99",cassandra_cluster="Test Cluster",cassandra_node="127.0.0.1",cassandra_datacenter="datacenter1",cassandra_rack="rack1"} 0.0 1737149678230
cassandra_client_request_latency_seconds{operation="write",consistency="TWO",quantile="0.999",cassandra_cluster="Test Cluster",cassandra_node="127.0.0.1",cassandra_datacenter="datacenter1",cassandra_rack="rack1"} 0.0 1737149678230
The Prometheus module correctly generates an event for each of the quantiles of cassandra_client_request_latency_seconds
, adding the correct quantile
label to each event.
However, it incorrectly adds cassandra_client_request_latency_seconds_count
and cassandra_client_request_latency_seconds_sum
to the event for the last quantile, while these two values actually are not associated with any quantile.
I inspected the relevant code and believe that the origin of the bug is in metricbeat/helper/prometheus/textparse.go
, specifically in the ParseMetricFamilies
function.
The code for processing the labels looks like this:
var lbls strings.Builder
lbls.Grow(len(mets))
var labelPairs = []*labels.Label{}
var qv string // value of le or quantile label
for _, l := range lset.Copy() {
if l.Name == labels.MetricName {
continue
}
if l.Name == model.QuantileLabel {
qv = lset.Get(model.QuantileLabel)
} else if l.Name == labels.BucketLabel {
qv = lset.Get(labels.BucketLabel)
} else {
lbls.WriteString(l.Name)
lbls.WriteString(l.Value)
}
n := l.Name
v := l.Value
labelPairs = append(labelPairs, &labels.Label{
Name: n,
Value: v,
})
}
lbls
is used for finding the correct metric in summaryMetricName(…)
. and the quantile
label is correctly excluded there, so that the count, the sum, and the values for all quantiles are associated with the same metric. However, the quantile
label is not excluded when generating the labelPairs). Later in the code,
labelPairs` is used in the following way when processing a metry of type summary:
case model.MetricTypeSummary:
lookupMetricName, metric = summaryMetricName(metricName, v, qv, lbls.String(), summariesByName)
metric.Label = labelPairs
if !isSum(metricName) {
// Avoid registering the metric multiple times.
continue
}
So metric.Label
is updated with the most recent labelPairs
, but because the count, the sum, and the values of all quantiles share the same metric
and thus the same metric.Label
, this means that all but one of them will now have an incorrect label.
For the quantile values, this behavior does not cause any problems, because the GeneratePromEvents{…)
function in metricbeat/module/prometheus/collector/data.go
overwrites the quantile
label for the quantile values. But it does not remove such a label (if it is present) for the count and sum values, so that they receive the incorrect label.
This bug only appears when at one of the quantile values comes last in the input from Prometheus. If the _count
or _sum
value comes last, the labels from that value are used and thus the quantile
label is not going to be included.
I believe that this bug can be fixed quite easily. The code that processes the labels (as mentioned) earlier simply has to be changed to not add the quantile
(or le
in the case of histograms) label to labelPairs
. So, the correct code should look like this:
var lbls strings.Builder
lbls.Grow(len(mets))
var labelPairs = []*labels.Label{}
var qv string // value of le or quantile label
for _, l := range lset.Copy() {
if l.Name == labels.MetricName {
continue
}
if l.Name == model.QuantileLabel {
qv = lset.Get(model.QuantileLabel)
} else if l.Name == labels.BucketLabel {
qv = lset.Get(labels.BucketLabel)
} else {
lbls.WriteString(l.Name)
lbls.WriteString(l.Value)
n := l.Name
v := l.Value
labelPairs = append(labelPairs, &labels.Label{
Name: n,
Value: v,
})
}
}
Can somebody please confirm this bug, so that I can open an issue in the Beats project on GitHub?