Signals export failing frequently as more than 20 services are exporting signals to elk at the same time

I'm currently working on a test project that involves multiple services implemented in Python, Java, React, and Node.js. These services utilize OpenTelemetry for observability. The architecture is designed such that:

  1. Agent Collectors: Several services (e.g., Service 1, Service 2, Service 3) are using a common OTLP collector, referred to as an agent collector.
  2. Central Collector: All these agent collectors forward their telemetry data (traces, metrics, logs) to a central OTLP collector.
  3. ELK Export: The central collector then exports this data to Elasticsearch (ELK) for storage and analysis.

The Problem:

We are intermittently experiencing the following errors during the metric export process:

plaintext

Copy code

error	exporterhelper/queue_sender.go:125	Exporting failed. No more retries left. Dropping data.	{"kind": "exporter", "data_type": "metrics", "name": "otlp/elastic", "error": "max elapsed time expired rpc error: code = DeadlineExceeded desc = context deadline exceeded", "dropped_items": 822}

info	exporterhelper/retry_sender.go:129	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "data_type": "metrics", "name": "otlp/elastic", "error": "rpc error: code = DeadlineExceeded desc = context deadline exceeded", "interval": "1.030954787s"}

Key Details:

  • Service Implementation: The issue occurs across multiple services using the OTLP protocol.
  • Common Pattern: The error indicates that the max elapsed time for retrying has expired and the context deadline has been exceeded. This results in data being dropped.

Given that the central collector is responsible for exporting telemetry data to Elasticsearch, I suspect this issue might be related to the ELK stack, particularly in handling large volumes of incoming telemetry data or network latencies.

Request for Assistance:

  • Root Cause: Could this issue be related to Elasticsearch's ability to handle telemetry data exported from the central collector? If so, what might be the underlying cause?
  • Potential Fixes: Are there any specific configurations or optimizations on the Elasticsearch side that we should consider to avoid these errors?
  • Further Diagnosis: Any recommendations on how to further diagnose or mitigate this issue would be greatly appreciated.

Thank you for your assistance!