Signals export failing frequently as more than 20 services are exporting signals to elk at the same time

Harshul · August 27, 2024, 10:39am

I'm currently working on a test project that involves multiple services implemented in Python, Java, React, and Node.js. These services utilize OpenTelemetry for observability. The architecture is designed such that:

Agent Collectors: Several services (e.g., Service 1, Service 2, Service 3) are using a common OTLP collector, referred to as an agent collector.
Central Collector: All these agent collectors forward their telemetry data (traces, metrics, logs) to a central OTLP collector.
ELK Export: The central collector then exports this data to Elasticsearch (ELK) for storage and analysis.

The Problem:

We are intermittently experiencing the following errors during the metric export process:

plaintext

Copy code

error	exporterhelper/queue_sender.go:125	Exporting failed. No more retries left. Dropping data.	{"kind": "exporter", "data_type": "metrics", "name": "otlp/elastic", "error": "max elapsed time expired rpc error: code = DeadlineExceeded desc = context deadline exceeded", "dropped_items": 822}

info	exporterhelper/retry_sender.go:129	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "data_type": "metrics", "name": "otlp/elastic", "error": "rpc error: code = DeadlineExceeded desc = context deadline exceeded", "interval": "1.030954787s"}

Key Details:

Service Implementation: The issue occurs across multiple services using the OTLP protocol.
Common Pattern: The error indicates that the max elapsed time for retrying has expired and the context deadline has been exceeded. This results in data being dropped.

Given that the central collector is responsible for exporting telemetry data to Elasticsearch, I suspect this issue might be related to the ELK stack, particularly in handling large volumes of incoming telemetry data or network latencies.

Request for Assistance:

Root Cause: Could this issue be related to Elasticsearch's ability to handle telemetry data exported from the central collector? If so, what might be the underlying cause?
Potential Fixes: Are there any specific configurations or optimizations on the Elasticsearch side that we should consider to avoid these errors?
Further Diagnosis: Any recommendations on how to further diagnose or mitigate this issue would be greatly appreciated.

Thank you for your assistance!

Harshul · October 16, 2024, 7:19am

Still waiting for a response on this !

Topic		Replies	Views
Issue with otlp-collectors and elastic Elasticsearch	3	288	August 26, 2024
Failed to push trace data via OTLP exporter: rpc error: code = Unavailable desc = connection closed APM open-telemetry	2	3736	October 28, 2021
Error sending trace data to elasticsearch using otel collector Elasticsearch	0	204	April 30, 2024
EDOT -> ELK. Collector error: dropping cumulative temporality histogram Elastic Observability docker , otel	2	22	March 12, 2025
APM server and otel-collectors issue APM open-telemetry	0	195	May 24, 2024

Signals export failing frequently as more than 20 services are exporting signals to elk at the same time

The Problem:

Key Details:

Request for Assistance:

Related topics