Compound instrumentation combinations

Hi again,

We've managed to use sampling rates, trace method duration, and span frames min duration to manage our data ingress volume. However, we are trying to come up with a way to also add forced instrumentation and tracing for spans (and transactions but I suspect transaction is computed from spans) that exceed a threshold despite a sampling rate. The reason is because we want to be sure to capture all transactions that are excessively long regardless of transaction rate, which may be on the extreme end of the long tail.

Any advice?

Thanks!

That threshold would be on the execution time?

That's quite a hard problem for distributed tracing because you normally want a trace to be consistent in the tracing decision. Meaning if one service samples the transaction, you want all the others also sample their transactions which belong to the same trace. Otherwise, you'd have gaps in the trace which may be confusing or misleading.

So one way of tackling this is to sample a large proportion or even all traces and to quickly discard those which are not interesting. That's also called tail-based sampling.

Thanks Felix, I suspected the same.

I already considered the tail-based sampling and would probably have to go that way eventually.

To provide context, our set up is a bit unusual with two (more in the future) data centers having their own APM Server instances. Metrics and traces are collected there but pushed to a third data center where we have the Elasticsearch cluster as that is where monitoring and alerting are located in to allow a single view of all our services. The amount of data coming from the two DCs is just immense and we managed to tame it via aggressive sampling (less than 1%). However, this sampling rate makes it hard for us to collect enough long-tail transactions.

PS
Yes, threshold based on execution time

One option for tail-based sampling is to send the spans and transaction from the APM Server to Kafka. Then you can use Kafka's stream processing capabilities to group by the Trace ID and determine whether to discard or to forward the trace to ES. I have never tried that out though.

Thanks for that tip, Felix.
I will have a look at that. I'm wondering if something similar can be done via Logstash to avoid having another tech stack; maybe something like a sampling filter in Logstash.

One thing I was looking at is the sampling mechanism. Is it possible to have some sort of sampling tag which can be processed server side instead of discarding the spans at the agent side? Correct me if I'm wrong but even with sampling, agents still record transaction instances to come up with accurate TPM numbers, so I was wondering if this mechanism can be leveraged somehow.

This topic was automatically closed 20 days after the last reply. New replies are no longer allowed.