OTEL - Prometheus receiver and missing metrics

Steve_Foster · March 30, 2025, 8:08pm

Hi folks,

i'm using the prometheusreceiver in a Deployment to scrape our pods every 60s, the metrics are then sent via OTLP to another instance of the collector running as a DaemonSet, which then sends them onto elastic cloud. What I'm finding is that the metrics are not coming through at regular 60s intervals,

For example if there are 3 pods, each minute between 0 and 3 records of a metric are indexed. which suggests that they are being dropped at some point.

In the collector that is performing the scrapes I can see that they being performed regularly, so I'm starting to think that there is an issue in the gateway collector or the elastic cloud end.

Is there a way to identify that they gateway is performing as expected? So that I can eliminate that part of the chain.

[cross posted with CNCF slack #otel-collector]

Steve_Foster · March 30, 2025, 8:45pm

Ah looking at the internal metrics on the gateway collector I can see that the otelcol_exporter_send_failed_metric_points counter is increasing all the time.

Steve_Foster · March 30, 2025, 8:50pm

Possible root cause

2025-03-30T20:45:38.613Z        error   elasticsearchexporter@v0.112.0/bulkindexer.go:344       failed to index document        {"kind": "exporter", "data_type": "metrics", "name": "elasticsearch/otel", "index": ".ds-metrics-otel-default-2025.03.30-000026", "error.type": "version_conflict_engine_exception", "error.reason": "[<ID>][<ID>@2025-03-30T20:45:21.786Z]: version conflict, document already exists (current version [1])", "hint": "check the \"Known issues\" section of Elasticsearch Exporter docs"}
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/elasticsearchexporter.flushBulkIndexer

Steve_Foster · March 30, 2025, 8:57pm

in the known issues for the exporter there is a requirement to be on 8.16.5 or above.

So our cluster is on that version, and even adding the component template as recommended for older versions doesn't address this.

removing the batch processor as recommended doesn't appear to address this either·

stephenb · March 31, 2025, 12:01am

If you put in the component template it will not take affect until the data stream rolls over which could take some time

You can try gong Kibana - Dev Tools

POST metrics-otel-default/_rollover

When you do that it will create the new backing index which use the component template...that said because TSDS the new metrics will not to start flow into that new backing index. For ~30 mins (long explanation left out)

Plus you did share your configuration from either collector ..

And I see you are using version 112.. I think the latest is 122

Steve_Foster · March 31, 2025, 2:17am

Yes the index has rolled over already, the ILM policy that includes this data stream is set to roll over daily.

In my support case I shared all 3 relevant configs used by the in-cluster operator.

So I'm trying to follow the documented and supported versions... The generated values file that is passed into the helm chart sets the image to docker.elastic.co/beats/elastic-agent:8.16.5. Which matches the Elastic cluster version.

Does the otel/opentelemetry-collector-contrib:0.122.1 have the same distribution config as the elastic-agent?

Steve_Foster · March 31, 2025, 2:30am

Looks like 8.16.x elastic agent uses v0.112.0 versioned components from the opentelemetry distribution.

In 8.17. this changes to v0.119.0

stephenb · March 31, 2025, 3:29am

This is a public forum we do not have access to your support case, but good sounds like you are working with support ...

Might be worth a try...

And of course as I am sure you know Elastic Agent in OTEL mode is still technical preview...

This functionality is in technical preview and may be changed or removed in a future release. Elastic will work to fix any issues, but features in technical preview are not subject to the support SLA of official GA features

We absolutely appreciate you using / trying and reporting issues!

Steve_Foster · March 31, 2025, 8:06am

happy to share here as well, is it best to just paste it in full or share a gist/pastebin?

Steve_Foster · April 15, 2025, 8:59pm

Just to follow up as I addressed this with the developer of the exporter

Using otel-colector-contrib:0.122.0 or otel-colector-contrib:0.124.0 I'm now seeing no errors about duplicate metrics. Our hosted Elastic cluster is running 8.16.6
Ultimately we needed to add a component template using

PUT _component_template/metrics-otel@custom 
{
  "template": { 
    "mappings": {
      "properties": {
        "_metric_names_hash": { 
          "type": "keyword",
          "time_series_dimension": true 
        } 
      } 
    }
  }
}

Then roll over the data stream to use a new index (docs don't get routed into the new index for 30m) with

POST metrics-prometheusreceiver.otel-default/_rollover

Although the _metric_names_hash is indexed dynamically, i think it'll be missing the time_series_dimension: true on that field.
Also, i think you've helped us discover a bug in releasing this _metric_names_hash workaround.

stephenb · April 15, 2025, 9:35pm

Nice @Steve_Foster

Thank you ...

Looking at these docs...

github.com/open-telemetry/opentelemetry-collector-contrib

exporter/elasticsearchexporter/README.md

main

# Elasticsearch Exporter

<!-- status autogenerated section -->
| Status        |           |
| ------------- |-----------|
| Stability     | [development]: metrics, profiles   |
|               | [beta]: traces, logs   |
| Distributions | [contrib] |
| Issues        | [![Open issues](https://img.shields.io/github/issues-search/open-telemetry/opentelemetry-collector-contrib?query=is%3Aissue%20is%3Aopen%20label%3Aexporter%2Felasticsearch%20&label=open&color=orange&logo=opentelemetry)](https://github.com/open-telemetry/opentelemetry-collector-contrib/issues?q=is%3Aopen+is%3Aissue+label%3Aexporter%2Felasticsearch) [![Closed issues](https://img.shields.io/github/issues-search/open-telemetry/opentelemetry-collector-contrib?query=is%3Aissue%20is%3Aclosed%20label%3Aexporter%2Felasticsearch%20&label=closed&color=blue&logo=opentelemetry)](https://github.com/open-telemetry/opentelemetry-collector-contrib/issues?q=is%3Aclosed+is%3Aissue+label%3Aexporter%2Felasticsearch) |
| [Code Owners](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/CONTRIBUTING.md#becoming-a-code-owner)    | [@JaredTan95](https://www.github.com/JaredTan95), [@carsonip](https://www.github.com/carsonip), [@lahsivjar](https://www.github.com/lahsivjar) |

[development]: https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/component-stability.md#development
[beta]: https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/component-stability.md#beta
[contrib]: https://github.com/open-telemetry/opentelemetry-collector-releases/tree/main/distributions/otelcol-contrib
<!-- end autogenerated section -->

This exporter supports sending logs, metrics, traces and profiles to [Elasticsearch](https://www.elastic.co/elasticsearch).

The Exporter is API-compatible with Elasticsearch 7.17.x and 8.x. Certain features of the exporter,
such as the `otel` mapping mode, may require newer versions of Elasticsearch. Limited effort will

This file has been truncated. show original

Was this the fix?

github.com/open-telemetry/opentelemetry-collector-contrib

[exporter/elasticsearch] Add _metric_names_hash to avoid metric rejections

main ← felixbarny:es-metric-name-hash

opened 06:08PM - 27 Jan 25 UTC

felixbarny

+90 -27

If metrics that have the same timestamp and dimensions aren't grouped into the s…ame document, ES will consider them to be a duplicate. This adds a hash of the metric names that will be mapped as a dimension in Elasticsearch. The tradeoff is that if the composition of the metrics grouping changes over time, a new time series will be created. That has an impact on the rate aggregation for counters. ES mapping changes: https://github.com/elastic/elasticsearch/pull/120952

Steve_Foster · April 15, 2025, 10:06pm

Yes however our hosted cluster is on 8.16.6 so shouldn't have needed the custom component template based on the wording of conditions about when you would need to add it or not.

I think this is what foxed Carson in the first instance because it looked like it should have been fine.

Topic		Replies	Views
Issue with otlp-collectors and elastic Elasticsearch	3	559	August 26, 2024
EDOT -> ELK. Collector error: dropping cumulative temporality histogram Elastic Observability docker , otel	2	178	March 12, 2025
Open Telemetry - flush failed (502) Elastic Agent elastic-stack-security	1	64	October 7, 2025
EDOT pushing to elastic fails with HTTP status message: Bad Request Elastic Observability docker , otel	5	158	March 10, 2025
Elastic agent (metricbeat logs): Cannot index event publisher.Event Beats fleet , metricbeat , elastic-agent	12	5007	October 4, 2021

OTEL - Prometheus receiver and missing metrics

Related topics