Missing spans

Hi,

I'm trying to troubleshoot spans that seem to be disappearing in a distributed trace environment.

In my current case, associations between transactions across multiple services are correctly tracked.

However, some spans I expect to show up are not showing up. One service is affected. I have a couple of ideas of where complications might arise.

This service is using knex which seems to be listed as an "offending module"

https://www.elastic.co/guide/en/apm/agent/nodejs/master/supported-technologies.html

It's also using babel-register which I am sure to require before I require elastic-apm-node and start() it. This service's entry point checks NODE_ENV and then either evaluates a module that was built by babel statically and stored on disk, or in the case of a non-production environment does require('babel-register') to get on-demand transpilation. This latter situation is the one I'm troubleshooting now.

Here are my exact versions:

elastic-apm-node@2.8.0
babel-register@6.26.0
babel-core@6.26.3
knex@0.13.0

Are there any docs that explain the mechanisms used by apm-agent-nodejs track the "active" transaction so that when I call the startSpan() method of the Agent instance it knows which transaction it belongs to? I think that might be useful knowledge in troubleshooting this.

Any advice for troubleshooting would be welcome.

Thanks!

Is there any documentation for the documents stored in ES by APM Server? That could be really helpful for me just to understand what I'm seeing better.

knex shouldn't interfere with the collection of spans and their association with each other. So I don't think that's an issue. I do however notice that you're using Babel, which has given some problems in the past where spans would not be collected as expected.

I recommend that if you use Babel, that you do not manually require and then start the agent inside your source code as that can lead to some missing spans in case the agent doesn't add its hooks in time. Instead, you should use the approach described here:

This is, unfortunately, a pretty complicated topic. We have not documented a "behind the scenes" look at how this works directly, but I've given quite a few talks about the general concept that might be of interest to you. If you're interested, you can also take a look at the code that handles async context propagation for the agent.

You can find information about the data model here:

1 Like

@wa7son thanks for the great info :slight_smile:

Instead, you should use the approach described here

I looked at the start module and I thought I didn't need to heed the advice :wink:

That module only seems to invoke start() so I'm unsure how exactly it's meant to solve this problem. Any insight you can provide would be very helpful.

Will this solution work for me if I need to pass some arguments to start() or if I need to use addPatch?

What I hoped I could do instead is have an ordinary (non es-modules, non-babel) entry point which require()s babel-register, starts APM, and then loads the rest of the application.

I'll take a look at the code you linked soon! Thanks again!

The issue that the start module tries to circumvent is what happens when Babel converts this:

require('elastic-apm-node').start({...})

import express from 'express'

To this (roughly):

const express = require('express')

require('elastic-apm-node').start({...})

As you can see, Babel will convert all import statements to require() calls and hoist them above the call to agent.start(). And since this call is required to be before any code that it needs to instrument, this will mean that the agent misses certain events in the code.

If you instead wrote the code like this:

import agent from 'elastic-apm-node/start'
import express from 'express'

It will be converted to this (roughly):

const agent = require('elastic-apm-node/start')
const express = require('express')

Which still works as expected.

No, unfortunately, the tradeoff of using elastic-apm-node/start is that you have to configure it using either environment variables, or the Elastic APM config file as described in our ES Modules docs.

For now, you need to call addPatch() after starting the agent, but as long as this happens before the call to require is made that the patch is supposed to patch, that will still work. If you need support for adding patches during start, we can add a config option that will allow you to register patches via environment variables instead. But in many cases, you should be able to just add the patch via a call to addPatch() before you require the module that should be patched.

I think I played around with this a while back to see if that was a viable workaround. but I could never get it to work properly if I recall correctly. That's why we ended up recommending the other approach. But if you do find a way to make it work, please let me know :smiley:

I'll do some more research and debugging and see what I find out!

Thanks again!

An alternative would be to put all your agent init logic into a separate file, which basically would serve as your own start file. This could then be imported at the very top like normal.

Question: Are you running your code by first compiling it to disk and then running the compile code, or are you running it via a babel-runner that compiles and runs it on the fly?

If you're doing the latter, babel might inject some things before the agent, which will make things break. If so I recommend using the -r elastic-apm-node/start command line flag for node (e.g. node -r elastic-apm-node/start app.js).

In production I compile AOT, and load the transpiled code's entrypoint in my non-transpiled top level entry point.

Outside of production (where I'm having the problem atm,) I am using babel-register to transpile on the fly.

Taking a look at the babel-register package on npm, it looks like I should probably consider changing to another package (@babel/register maybe.)

Could you share your babel-register setup, so I can try it out? I'd like to update the docs to show how to best use the agent together with babel-register if I can figure it out

This is basically what I'm doing now, but as I've mentioned, it's giving me some trouble.

I'm using a non-transpiled bootstrap script that invokes either the AOT transpiled entrypoint, or loads babel-register and then load my entrypoint which is dynamically transpiled:

if (process.env.NODE_ENV === 'production') {
  require('elastic-apm-node').start(...)
  require('./dist/index.js') // transpiled AOT
} else {
  require('babel-register')
  require('elastic-apm-node').start(...)
  require('./src/index.js') // transpiled JIT
}

FYI I have the same issues in production mode

One pattern I'm seeing in spans that appear and spans that don't appear is that ones that don't appear happen to be using knex.raw and ones that do appear are not using knex.raw. Edit: Adding two simple tests to this service, one using raw and one not, does not reproduce this same pattern.

Actually, it's interesting that I don't see two spans for each database request made via knex as our underlying knex driver is mysql2.

How do the APM instruments for knex and mysql2 avoid tracing the same requests?

I've enabled logger in elastic-apm-node and I'm seeing this in a few places:

no active transaction found - cannot build new span

So, I infer that the span is being collected, but it can't be associated with the transaction it ought to belong to. But the transaction is always created (I've not noticed any missing transactions.)

I would think that the typical failure mode from Babel interference would be that some or all uses of a module that should be instrumented may not be instrumented, and so transactions that should be created may not be, or spans that should be created may not be.

Is this a correct assumption? Or has it been observed that sometimes Babel causes failures to track transactions so that spans can be associated to them?

Thanks!

I just tried to run some of the tests with -r babel-register, but couldn't reproduce the problem. I think maybe the problem lies elsewhere.

This, unfortunately, sounds more like a context propagation issue due to a user-land callback queue. I.e. some of your dependencies (direct or indirect) maintains a queue of callbacks outside the control of Node core. When any of these callbacks are called, the agent will not know which transaction was the active transaction when they were originally queued.

We explicitly patch modules known to cause this issue, but once in a while a new module pops up that causes this problem.

To rule out an issue with babel-register could I get you to test if the problem goes away if you start the agent using a local config file instead and with node -r elastic-apm-node/start app.js. I think that will rule out any issues caused by babel-register.

Yes I will give this a try this weekend

I believe I understand what you're saying.

Assuming your suggested test demonstrates that the problem is not Babel, do you have any advice for how to go about detecting the site of the complication?

Thanks again!

If the issue is a context propagation issue caused by a dependency, we need to find which dependency it is so we can make a patch for it.

The first step here is sharing your dependencies from your package.json so we can see if we can spot which dependency might be causing this and then read through its source code.

Sorry, I haven't had time to revisit this issue this week.

I'm just back from vacation my self. Did you get time to look into this?

Not yet sadly, however I did find something interesting.

In many parts of the service, a promise is created to fulfill an HTTP request, but might never be resolved or rejected.

Could this cause a problem?