Distributed tracing missing expansion for one service

If you are asking about a problem you are experiencing, please use the following template, as it will help us help you. If you have a different problem, please delete all of this text :slight_smile:

Kibana version: 7.3.2

Elasticsearch version: 7.3.2

APM Server version: 7.3.2

APM Agent language and version: 2.11.0 (ruby, rails 5.3)

Browser version: N/A

Original install method (e.g. download page, yum, deb, from source, etc.) and version: Elastic Cloud

Fresh install or upgraded from other version? Fresh Install

Is there anything special in your setup? For example, are you using the Logstash or Kafka outputs? Are you using a load balancer in front of the APM Servers? Have you changed index pattern, generated custom templates, changed agent configuration etc.

N/A

Description of the problem including expected versus actual behavior. Please include screenshots (if relevant):

We're trialing the APM solution and it's all working pretty well. However, there's an odd behaviour with distributed tracing that I'm struggling to understand so I'll try and explain it. First the setup

3 services - webapp, "pod" API and "edge" API. All are ruby Rails applications deployed on AWS Elastic beanstalk. All services have the latest ES APM gem installed.

When viewing a trace that involves a call to the 'edge API', it's like the APM agent doesn't see that this is part of the larger trace. For example

In this example above, the trace originates in the webapp. the webapp then calls into the "pod API" (pd-sh-cb-p01-api) and is successfully "matched" and the method from the pod API is shown. However, the next call into the "edge API" (GET api) just shows as a regular http GET with no further information. It turns out this calls a controller in this edge API with a #show method and I verified that when I look in APM at the edge API, that is listed as a transaction.

Here's what I see for the call to the pod API (GET pd-sh-cb-p01 in the above shot):

But the call to our "edge API" (GET api..) only shows

The call is using net-http which is a supported library so I'm at a loss to explain why some calls are being "distributed traced" and others are not. It does seem like it's only calls to (what we call) this edge API so is there some extra config I need to enable here that wouldn't be on the other agents? Spent a few hours trying to see differences but can't come up with any

Help appreciated!

Cheers

Hi Dave!

This looks mysterious, you're right.

Distributed Tracing works by passing along a header called Elastic-Apm-Traceparent with the relevant trace ids. Perhaps check if this is present through all the services?

We'll look into this early next week and see if I can figure something out or reproduce it.

Mikkel, Elastic

Mikkel,
I posted that the header wasn't set but I was mistaken. So what I found is in our staging environment, things are working! The big difference is the volume of data in staging is super low (like 1 request per min Vs 50,000/min in prod).

Looking in the agent log on the "edge api" service I'm seeing:

Queue is full (256 items), skipping

so I'm guessing we're just missing getting data on the APM server. I also know that the APM server is under-sized while we're on the trial.

Reading this: https://www.elastic.co/guide/en/apm/agent/ruby/current/debugging.html

it looks like there's a few suggestions. What's the best one you've seen work though to keep distributed tracing working? I'd imagine that setting transaction_sample_rate to a lower value would cause traces to be missed and we'd have the same problem?

Regardless, given that it works in staging, I think this is the root cause....can't view a span that isn't there :slight_smile:

Looking forward to any recommendations you may have.

Cheers

Dave

Ah looks like the server just can't handle the volume too. On the client logs as well as

Queue is full (256 items), skipping

I'm also seeing

APM Server responded with an error: "{\"accepted\":30,\"errors\":[{\"message\":\"queue is full\"}]}\n"

So looks like this is just a pure volume thing. I'll play with some of the client settings (can't change the server size during trial alas) but I suspect there won't be much to do without a bigger server.

Good to hear you found the cause!

Sampling is a good place to start, as you can quickly scale down the payload size and count and then scale it up little by little if needed.

Distributed Tracing will still work, as the agent sets the header even though the transaction isn't sampled. The parent will just be the topmost transaction instead of the span.

1 Like

This topic was automatically closed 20 days after the last reply. New replies are no longer allowed.