OTEL - Prometheus receiver and missing metrics

Hi folks,

i'm using the prometheusreceiver in a Deployment to scrape our pods every 60s, the metrics are then sent via OTLP to another instance of the collector running as a DaemonSet, which then sends them onto elastic cloud. What I'm finding is that the metrics are not coming through at regular 60s intervals,

For example if there are 3 pods, each minute between 0 and 3 records of a metric are indexed. which suggests that they are being dropped at some point.

In the collector that is performing the scrapes I can see that they being performed regularly, so I'm starting to think that there is an issue in the gateway collector or the elastic cloud end.

Is there a way to identify that they gateway is performing as expected? So that I can eliminate that part of the chain.

[cross posted with CNCF slack #otel-collector]

Ah looking at the internal metrics on the gateway collector I can see that the otelcol_exporter_send_failed_metric_points counter is increasing all the time.

Possible root cause

2025-03-30T20:45:38.613Z        error   elasticsearchexporter@v0.112.0/bulkindexer.go:344       failed to index document        {"kind": "exporter", "data_type": "metrics", "name": "elasticsearch/otel", "index": ".ds-metrics-otel-default-2025.03.30-000026", "error.type": "version_conflict_engine_exception", "error.reason": "[<ID>][<ID>@2025-03-30T20:45:21.786Z]: version conflict, document already exists (current version [1])", "hint": "check the \"Known issues\" section of Elasticsearch Exporter docs"}
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/elasticsearchexporter.flushBulkIndexer

in the known issues for the exporter there is a requirement to be on 8.16.5 or above.

So our cluster is on that version, and even adding the component template as recommended for older versions doesn't address this.

removing the batch processor as recommended doesn't appear to address this either·

If you put in the component template it will not take affect until the data stream rolls over which could take some time

You can try gong Kibana - Dev Tools

POST metrics-otel-default/_rollover

When you do that it will create the new backing index which use the component template...that said because TSDS the new metrics will not to start flow into that new backing index. For ~30 mins (long explanation left out)

Plus you did share your configuration from either collector ..

And I see you are using version 112.. I think the latest is 122

Yes the index has rolled over already, the ILM policy that includes this data stream is set to roll over daily.

In my support case I shared all 3 relevant configs used by the in-cluster operator.

So I'm trying to follow the documented and supported versions... The generated values file that is passed into the helm chart sets the image to docker.elastic.co/beats/elastic-agent:8.16.5. Which matches the Elastic cluster version.

Does the otel/opentelemetry-collector-contrib:0.122.1 have the same distribution config as the elastic-agent?

Looks like 8.16.x elastic agent uses v0.112.0 versioned components from the opentelemetry distribution.

In 8.17. this changes to v0.119.0

This is a public forum we do not have access to your support case, but good sounds like you are working with support ...

Might be worth a try...

And of course as I am sure you know Elastic Agent in OTEL mode is still technical preview...

This functionality is in technical preview and may be changed or removed in a future release. Elastic will work to fix any issues, but features in technical preview are not subject to the support SLA of official GA features

We absolutely appreciate you using / trying and reporting issues!

happy to share here as well, is it best to just paste it in full or share a gist/pastebin?

Just to follow up as I addressed this with the developer of the exporter

Using otel-colector-contrib:0.122.0 or otel-colector-contrib:0.124.0 I'm now seeing no errors about duplicate metrics. Our hosted Elastic cluster is running 8.16.6
Ultimately we needed to add a component template using

PUT _component_template/metrics-otel@custom 
{
  "template": { 
    "mappings": {
      "properties": {
        "_metric_names_hash": { 
          "type": "keyword",
          "time_series_dimension": true 
        } 
      } 
    }
  }
}

Then roll over the data stream to use a new index (docs don't get routed into the new index for 30m) with

POST metrics-prometheusreceiver.otel-default/_rollover

Although the _metric_names_hash is indexed dynamically, i think it'll be missing the time_series_dimension: true on that field.
Also, i think you've helped us discover a bug in releasing this _metric_names_hash workaround.

1 Like

Nice @Steve_Foster

Thank you ...

Looking at these docs...

Was this the fix?

Yes however our hosted cluster is on 8.16.6 so shouldn't have needed the custom component template based on the wording of conditions about when you would need to add it or not.

I think this is what foxed Carson in the first instance because it looked like it should have been fine.

1 Like