Compound instrumentation combinations

digitalron · July 18, 2019, 1:30pm

Hi again,

We've managed to use sampling rates, trace method duration, and span frames min duration to manage our data ingress volume. However, we are trying to come up with a way to also add forced instrumentation and tracing for spans (and transactions but I suspect transaction is computed from spans) that exceed a threshold despite a sampling rate. The reason is because we want to be sure to capture all transactions that are excessively long regardless of transaction rate, which may be on the extreme end of the long tail.

Any advice?

Thanks!

felixbarny · July 18, 2019, 2:59pm

That threshold would be on the execution time?

That's quite a hard problem for distributed tracing because you normally want a trace to be consistent in the tracing decision. Meaning if one service samples the transaction, you want all the others also sample their transactions which belong to the same trace. Otherwise, you'd have gaps in the trace which may be confusing or misleading.

So one way of tackling this is to sample a large proportion or even all traces and to quickly discard those which are not interesting. That's also called tail-based sampling.

digitalron · July 18, 2019, 6:04pm

Thanks Felix, I suspected the same.

I already considered the tail-based sampling and would probably have to go that way eventually.

To provide context, our set up is a bit unusual with two (more in the future) data centers having their own APM Server instances. Metrics and traces are collected there but pushed to a third data center where we have the Elasticsearch cluster as that is where monitoring and alerting are located in to allow a single view of all our services. The amount of data coming from the two DCs is just immense and we managed to tame it via aggressive sampling (less than 1%). However, this sampling rate makes it hard for us to collect enough long-tail transactions.

PS
Yes, threshold based on execution time

felixbarny · July 19, 2019, 7:44am

One option for tail-based sampling is to send the spans and transaction from the APM Server to Kafka. Then you can use Kafka's stream processing capabilities to group by the Trace ID and determine whether to discard or to forward the trace to ES. I have never tried that out though.

digitalron · July 19, 2019, 10:54am

Thanks for that tip, Felix.
I will have a look at that. I'm wondering if something similar can be done via Logstash to avoid having another tech stack; maybe something like a sampling filter in Logstash.

One thing I was looking at is the sampling mechanism. Is it possible to have some sort of sampling tag which can be processed server side instead of discarding the spans at the agent side? Correct me if I'm wrong but even with sampling, agents still record transaction instances to come up with accurate TPM numbers, so I was wondering if this mechanism can be leveraged somehow.

system · August 9, 2019, 6:54am

This topic was automatically closed 20 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Is it possible to limit transaction sampling rate by transaction duration? APM nodejs	4	485	October 20, 2020
Tail-based sampling APM dotnet , server	4	524	August 29, 2023
Is it possible to record the transaction with sample rate as well, but not all of them? APM	4	1851	July 22, 2019
Doubt about distributed tracing between services and so much spans APM java , ui	4	1365	September 30, 2020
Capture all traces & spans and custom configuration doubt APM java	8	1145	November 13, 2019

Compound instrumentation combinations

Related topics