Jaeger dropping spans with Elasticsearch backend in Kubernetes

kashokan · January 19, 2022, 2:29am

Hello,

I have installed Jaeger with an Elasticsearch storage backend in Kubernetes (AWS EKS). I have 8 r5.4xlarge nodes and Elasticsearch version 7.16.2 (docker.elastic.co/elasticsearch/elasticsearch:7.16.2). The deployment is done using the Elasticsearch helm chart (helm-charts/elasticsearch at main · elastic/helm-charts · GitHub). The values I have overridden are:

  elasticsearch:
    replicas: 4
    minimumMasterNodes: 3
    volumeClaimTemplate:
      accessModes: ['ReadWriteOnce']
      resources:
        requests:
          storage: 2000Gi
    resources:
      requests:
        cpu: '5000m'
        memory: '32Gi'
      limits:
        cpu: '8000m'
        memory: '32Gi'
    esJavaOpts: '-Xmx16g -Xms16g'

I have been seeing periodic dropping of spans, correlating with a spike in in-queue and save latency. Save latency, particularly is an indication that the storage backend isn't able to keep up. Additionally, the spikes correlate with an increase in flush operation timing. Also, the index write's memory steadily increases until the flush, when it drops and we have a drop in spans and spikes in latency. I don't have much experience with Elasticsearch, so any help/guidance is appreciated!

I have attached some screenshots of our Jaeger and Elasticsearch dashboards.

system · January 19, 2022, 7:30am

7.1 is EOL and no longer supported. Please upgrade ASAP.

(This is an automated response from your friendly Elastic bot. Please report this post if you have any suggestions or concerns )

kashokan · January 21, 2022, 10:07pm

The issue was not Elasticsearch, but rather the way we were routing traffic from our OpenTelemetry collector fleet to our Jaeger collector fleet. We were routing using Kubernetes' internal routing jaeger-collector.jaeger.svc.cluster.local:14250. We noticed the traces were not distributed evenly across the jaeger collectors, so we swapped in an Application load balancer with a target group using the gRPC protocol version. The root cause for us was the way gRPC works. After the swap, we are not dropping spans, translog ops and size is tiny and the index write memory is steady and flat.

system · February 18, 2022, 10:07pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.