Hi! I'm facing a strange case related to distributed tracing:
There are two apps deployed on a k8s cluster, both using elasticAPM java agent. The first app executes N parallel calls to the another via HTTP, using Fork/Join (ForkJoinPool) Java classes.
The second application also manages the service request in an asynchronous way executing its implementation method (service#callService) with Fork/Join too.
Sometimes, on the first request to the app after deployment, I can see that some spans of the second application (service#callService) are not recorded properly. They're tracked with a different traceId and get dettached outside from the original request scope.
Currently, that's the only situation where this issue arises. I've made some tests with more than one thread using Jmeter, and it worked OK the rest of the time, showing in APM monitor all the request spans.
This ¿bug? was also reproduced using the opentracing bridge.
Have you ever experienced an issue like this? maybe could it be related to Fork/Join components?
Making sure I understand correctly: the TicketGroup2Controller#singleSelectTicket transaction is traced properly as child of the POST span from the root transaction, however, the SingleSelectTicketService#callService span is sometimes traced with improper parent and wrong trace ID? Is it still captured and sent to the APM server, but with wrong trace ID? If you search in Kibana for this trace ID, do you find additional events?
Also, what is producing the SingleSelectTicketService#callService span? Is it the agent's auto-instrumentation? If so- for which technology? If not (meaning - if you creating it manually), please add your code that creates this span.
So, this can be observed only within the first request, but not in any subsequent requests?
Yes, it's ok. #singleSelectTicket transaction is traced properly. But sometimes callService span (child of the previous one) is tracked with another trace ID and isn't shown in the chart (in the picture sent you can see that the 2nd and 3rd transactions don't contain that span).
I can see the "lost" spans in Kibana but as unrelated transactions (not associated with the original main request transaction).
I tested it producing these spans with two different strategies:
Using Elastic APM java agent via trace_methods property configuration.
Using an implementation of opentracing bridge.
The issue was observed in both alternatives
The apps are based on spring-boot 2.x (supported by elastic java agent)
Yes, just in the first transaction request after application deployment. So far that's the only case where I've seen this issue.
In order to see if the issue occurs in another situation, I've made some additional tests, but they worked OK.
I think this miss in the first request happens due to the fact that we are applying our inter-thread context propagation instrumentation dynamically when we first encounter a task/runnable/callable type. We do that in order to reduce the impact on startup time.
The reason you get the forked task as a separate transaction is because you trace it with trace_methods, which creates a transaction if there is no active one on the executing thread, which is the case in the first invocation due to the context-propagation miss explained above. It is the same with OT bridge.
Bottom line, this may be the cost of reducing startup time overhead of the agent. Would you say this is problematic enough for you to trade it with startup time increase? (Not suggesting we will do that, but interested in your opinion )
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.