Help understanding a gap in my trace

I have attached a picture of a trace between two microservices. I am wonder why there is such a big gap between when the request is fired and the other service processes it. The second service is processing a result extremely quickly but this gap is adding almost 1000% to my service call time. This is a nodejs project using axios to make service calls between 2 apps on the same kubernetes cluster.

Nice to meet you @bcollins -- it's always tricky to debug someone else's application remotely, but I'll let you know what I see.

I'm going to assume every service is running on Node.js from top to bottom -- if that's not the case that may complicate the theories. Let us know.

This trace looks like it represents either a custom transaction or a transaction where you've set a custom name of brokerage_account_update_totals. It's not clear what this transaction is -- it could be a web application responding to a web request, or a it could be a cli job, a queue job, etc. Without context it's hard to say.

Regardless -- the code in this transaction makes an HTTP request to

`http(s?)://some-service.example.com/api/v1/synthetic_trade_orders` 

The service at /api/v1/synthetic_trade_orders reports that it begins processing the request 600ms later.

The first two things that pop to mind are this

First -- the Elastic Node.js agent should report its transaction time when the server actually begins processing the request. If the service at /api/v1/synthetic_trade_orders has a saturated event loop/microtask queue it may be that this trace represent a legitimate problem where /api/v1/synthetic_trade_orders couldn't begin processing the transaction until that time. This would result is a slow to respond transaction. This may only be an occasional occurrence which is why you're seeing fast response times -- how often does this long trace show up?

Second -- you can sometimes see time gaps like this when the server time isn't synced. If you have the ability to check the time on the kube pods it would be good to rule out an NTP syncing issue.

Also -- which axios method are you using to make your request and which HTTP server are you using in these services? (There may be weird async issues in how these libraries are implemented so it'd be good to rule that out)

Hi @alanstorm, thank you for your response.

More clarification:
brokerage_account_update_totals is a handler for a rabbitmq message, which is why I wrapped in custom instrumention.

  1. Overall the services are under low load and calling /api/v1/synthetic_trade_orders directly results in a fast response, so I doubt it is event loop saturation. I am seeing these gaps quite frequently everywhere in both directions across multiple microservices.
  2. I have confirmed the times between my pods are indeed correct.
  3. I am using a simple axios.get(${config.services.api.host}/api/v1/synthetic_trade_orders) and I am using Hapi and Express (some microservices express, others have been refactored to hapi)

Here is another example. I have a very lightweight hapi service which is simply a socket.io server (with hapi) which is doing very little at the moment, with only 2 or 3 connection. I fire a POST call at it to send socket messages to users, while it sends out the actual socket message extremely quickly, there is still a gap (on both the request and response side):

and another example:

Thanks for that additional context @bcollins -- nothing seems out of the ordinary in what you've described so far. Here's some other questions/thoughts that come to mind that might help us track this down. (Apologies if all/some of this is obvious, I err on the side of over communicating sometimes :slight_smile: )

Overall the services are under low load and calling /api/v1/synthetic_trade_orders directly results in a fast response, so I doubt it is event loop saturation. I am seeing these gaps quite frequently everywhere in both directions across multiple microservices.

This definitely doesn't sound like event loop saturation from too much traffic -- but there may be some synchronous work that the service at ${config.services.api.host} is doing that blocks the node process from processing the next incoming request. Is api/v1/synthetic_trade_orders the only endpoint for this service? What sort of work does this service do? Is there a chance there's some long running synchronous work that might block the event-loop?

Re: event loop delay/blocking the agent has a metric that can detect this. If you chart nodejs.eventloop.delay.avg.ms over time via a Kibana Visualization you might be able to correlate event loop delays with these gaps you're seeing (or confirm that there's no event loop delay when these gaps occur)

Additionally -- what does your Transaction duration distribution graph look like? Are these long running traces outliers? Or are they showing up in every bucket during normal operations?

Finally -- you mentioned running some manual tests where everything "seemed fast". Have you been able to correlate these specific requests to specific traces? Often the sort of delays that these charts represent are outliers that only happen during specific blocking events with a service.

What actually led me to this path of instrumentation was indeed some blocking call, I found that one of the apps, when hit with a type of internal authentication was running bcrypt for every request. After I eliminated that I got much better response time and solved a serious gap. But this gap remains everywhere. The user-sockets service picture I posted above, that service is absolutely doing nothing that could block it's event loop, yet it also shows big gaps before and after the request.

nodejs.eventloop.delay.avg.ms is a very interesting thing to learn, I will investigate this next.

Also no they do not appear to be outliers, I am seeing them all the time. Sometimes they are quite a bit shorter but more often then not they aren't, and looking at the cpu metrics on kube the services do not appear to be using much cpu.

I will continue investigating with the info you've provided, but I am wondering, is there any way to get the instrumentor to instrument every single internal nodejs function call automatically?

EDIT: Am I doing this right? I'm not seeing any results...

EDIT: I got this query to work, however it only shows 1 of our apps with a semi high result... and this app should have no effect on the other instrumented services as it is completely independent.

Transactions graphs:

@bcollins sorry to throw you in the graphing UI without any guidance.

It looks like from your screenshot you're graphing things by count, not time. Just in case you're still having trouble there -- your best two options for graphing this metric are

  1. We're working on fixing our importable dashboards -- if you wanted to give the JSON file in that PR a try (Stack Management -> Kibana -> Saved Objects, click Import) it will give you a new dashboard that graphs a number of different Node.js metrics, including the event loop delay

  2. Create a Lens Chart. Lens charts tend to be the easier to get started with -- just drag fields into the y axis and go.

is there any way to get the instrumentor to instrument every single internal nodejs function call automatically?

Unfortunately, No, there is not a way to do this. The performance cost of doing do would be prohibitive. The best we can do here for you is our startSpan API: https://www.elastic.co/guide/en/apm/agent/nodejs/master/agent-api.html#apm-start-span

With this you can add your own timings to your application and determine where the delays are. Hope that helps, and good luck!