APM does not display spans despite having them in ES

I have two Django apps and celery running tasks in background, and Kibana seems to not display properly spans for transaction that are stored in ES.


Transaction metadata shows transaction.span_count.started: 48 and indeed when I search for this transaction.id in Discover I see 49 documents

In this particular case, I had a request to service A, that send a request to service B, which scheduled about 1500 celery tasks (celery task transaction shown above). When I click on View full trace it shows the transaction of service A, but spans are not displayed

Transaction metadata shows transaction.span_count.started: 5 which means it only counts spans from service A, that makes sense. It does not however display transaction and spans from service B, which is stored in ES. I need to find the transaction.id from Discover view and filter service B transactions in APM only to see the same thing

This time transaction.span_count.started: 100 and transaction.span_count.dropped: 1557 as I have a 100 spans per transaction limit, everything as expected. Everything except for displaying those spans correctly in APM UI

Discover says I have 109k documents with the same trace.id, thats a bit more than I would expect but still, the data is there, it's just not displayed properly. It does seem to properly aggregate the span data tho, as Time spent by span type propery displays different span types that were created in this trace (I'd love to add a screen here but I can only add one media in the post)

This however does not happen regularly. In like 99% of cases when I browse APM logs for regular HTTP requests everything works fine. In like 10% of cases when I browse APM logs for celery tasks, this happens. I can easily find other examples when similar trace of Service A -> Service B -> Celery task renders correctly

Kibana version:
7.17.7

Elasticsearch version:
7.17.7

APM Server version:
7.17.7

APM Agent language and version:
Python 6.13.2

Browser version:
Chrome 119.0.6045.159

Original install method (e.g. download page, yum, deb, from source, etc.) and version:
n/a

Fresh install or upgraded from other version?
Fresh

Is there anything special in your setup?
n/a

Steps to reproduce:
I couldnt reliably reproduce the issue

Errors in browser console (if relevant):

GET https://kibana.[...].com/internal/apm/settings/anomaly-detection/jobs 403 (Forbidden)
Unchecked runtime.lastError: A listener indicated an asynchronous response by returning true, but the message channel closed before a response was received

Not sure if those are related

Provide logs and/or server output (if relevant):

I added the ui tag so we can hopefully get someone from the UI team to weigh in. I'm surprised that if the span documents are there and contain the same trace ID, we're not seeing them in the waterfall.

Up, can someone take a look at this?

I did a little more investigation and asking around here, and it seems that orphaned spans are not currently displayed in the waterfall. Orphaned spans are those missing a parent span (via parent ID).

Can you search out whether you can see the parent span for those spans in Kibana? Alternatively, you could export all of the documents with a given (problematic) trace ID so we could take a look. Be careful of sensitive information!

trace.id: 09974d7b3d43d880d0bd6ffab80bee1a and not parent.id: * gives only single document, all spans have parent.id. I'm not sure how to follow the parent.id tho, as I got a span from celery, it has parent.id == transaction.id, got that. But the transaction of this celery task has parent.id == service A HTTP call span.id.

I'd expect it to work like that:
1 Span from celery task transaction has parent.id == celery task transaction.id
2 Transaction from celery task has parent.id == span from service B (but which one? redis call? transaction B transaction.id?)
3 Spans from service B has parent.id == service B transaction.id
4 Transaction from service B has parent.id == service A HTTP call span
5 Spans from service A has parent.id == service A transaction.id
6 Transaction from service A has no parent.id

Instead point 2 looks like "Transaction from celery task has parent.id == service A HTTP call span", skipping the service B. This may not have anything to do with this case, as I see the same thing in other traces with the A -> B -> Celery flow which display correctly

I'm attaching the whole trace.id, beware it's about 60MB (I really need to prune some unhelpful labels) - https [colon slash slash] file [dot] io/bTEruaQ7p7fw (cant either upload a file here nor post a link...)

The fact that this is not happening on all of your transactions is certainly perplexing. Can you spot any difference between the spans that make up the working waterfall and the broken one?

Exactly this is what bothers me the most. I am able to reproduce the issue (just resend the request to service A) but I dont see anything unusual in the stored data that may explain this behavior

So the same request will always result in the issue? Now I'm wondering if we can spot the difference between a "good" request and a "bad" request -- especially around any resulting difference in the way the celery tasks would be scheduled.

I've managed to get the trace for this request render semi-correctly. It shows the waterfall of Service A -> Service B -> Celery tasks, but together with it the message Number of items in this trace exceed what is displayed. Despite this message it looks fine and usable

I've reduced the amount of data to be processed, so fewer tasks were triggered. There are 1057 documents with this trace.id.

1 Like

Ah, yes, I should have considered that we may be running up against the max size of the waterfall. Glad you have it working!