Sampling rate handling for distributed traces

Hi, we'd like to more understand the handling/interpretation of transaction_sample_rate across a distributed system. Does the value of transaction_sample_rate only affect the sampling at the entry point service, with the succeeding services all being sampled due to the presence of a trace.id value being propagated across, even if these succeeding services have lower than 1.0 sampling rate?

To illustrate, imagine services A and B

A=1.00, B=1.00 : request --> A ---> B 100%
A=0.50, B=1.00 : request --> A ---> B 50%
A=1.00, B=0.50 : request --> A ---> B 100% or 50%?
A=0.50, B=0.50 : request --> A ---> B 50% or 25%?

We are interested in dynamically changing our sampling rates in response to increasing or decreasing traffic as well as anomalies in errors or latencies detected. As a result, we are wondering whether we should change the sampling rate on all services along a trace path, or only the entry point services. This is important because we have services that are both entry-point and succeeding

TIA!

P.S.

AppDynamic has the option to set the sampling rate to 100% for x seconds/minutes from the backend for individual business transactions (equating to distributed traces in Elastic APM) to help capture requests during production diagnostics. Would love to see something similar in Elastic.

The first service determines whether the whole trace should be sampled or not. So in your examples, the sampling rate is always determined by service A. Only if B was directly invoked, it's configured sampling rate would be taken into account.

You may be pleased to hear we're planning to add remote configuration capabilities to the agents.

Thanks for the input! We're also evaluating other strategies for sampling like rate limited sampling and tail based (after-the-fact) sampling.

Thank you very much for the confirmation of the scenarios Felix, it's super appreciated.

We are also looking forward to the remote configuration capabilities of the agents. That would really help a lot in our ability to dynamically address latencies and exceptions.

Cheers!

This topic was automatically closed 20 days after the last reply. New replies are no longer allowed.