Aggregated Distributed Tracing View?

We are evaluating whether to use Elastic Search APM for distributed tracing in our company and had the following question (based on the public demo on https://demo.elastic.co/app/apm).

Is it possible to produce an aggregated distributed tracing view?

As far as we can see there is the ability to view aggregated transaction data (i.e. "Transaction duration" and "Requests per minute") but only the ability to see one random "Trace sample" in the "Timeline" view.

What would be most useful is to see an aggregate Timeline view showing the average durations across all the distributed traces for a given time period. This would show us the bottleneck in our application stack which a random "Trace sample" may not. If there is another way to do this, please let me know.

@dgrogan welcome to the forum!

TL;DR: it's technically possible to do by post-processing the data, and I think this would be very useful, but it's not straightforward. Input welcome.

This is something I've been thinking about for a little while, but haven't yet come up with a concrete proposal for. I think it would be useful, it's mainly a matter of coming up with an approach that won't be too expensive.

There is an ongoing effort to add CPU and heap profiling to the product, which we would visualise as a FlameGraph. I think it might be possible (and useful) to also present FlameGraphs of aggregated distributed traces.

The main question is how to do that efficiently. For CPU/heap profiling, we can hash the function names that make up a unique stack trace, in order to aggregate them; this isn't hard because stack traces are not distributed, so we can process them as a unit.

We can think of distributed traces like a collection of stack traces with associated wall-clock timings. To produce a FlameGraph, we would need to first query all transactions for the given time period/filters, then find each transaction's complete trace, and then for each path in each trace, aggregate the wall-clock timings for those paths (based on transaction/span names) incrementally.

e.g. given Rack -> GET opbeans-python -> GET opbeans.view.customers -> GET opbeans-golang, we would create aggregates for:

  • Rack
  • Rack -> GET opbeans-python
  • Rack -> GET opbeans-python -> GET opbeans.view.customers -> GET opbeans-golang

If transaction names were known up front, this wouldn't be too hard; we could propagate a hash of the path with distributed trace context. Unfortunately transaction names are not always known up front. Therefore we would either have to do this on demand, or as a batch or stream processing job.

Hi @dgrogan! A quick side note. You probably noticed this -- and if you did, my comment won't be relevant:

The randomly sampled transaction you see in your screenshot is based on your selection in the Transactions duration distribution . In your screenshot, you have the fastest bucket selected, which probably won't produce a useful trace sample for finding bottlenecks. If you select one of the slower requests on the right, you'll see a random trace sample from within in that bucket (time range). So, still not an aggregate timeline view, but hopefully useful for finding that bottleneck.

Anyway, back to your conversation with Andrew. Just wanted to share that information in case it was missed.