Help understanding a gap in my trace

bcollins · November 18, 2020, 6:56pm

I have attached a picture of a trace between two microservices. I am wonder why there is such a big gap between when the request is fired and the other service processes it. The second service is processing a result extremely quickly but this gap is adding almost 1000% to my service call time. This is a nodejs project using axios to make service calls between 2 apps on the same kubernetes cluster.

alanstorm · November 18, 2020, 11:54pm

Nice to meet you @bcollins -- it's always tricky to debug someone else's application remotely, but I'll let you know what I see.

I'm going to assume every service is running on Node.js from top to bottom -- if that's not the case that may complicate the theories. Let us know.

This trace looks like it represents either a custom transaction or a transaction where you've set a custom name of brokerage_account_update_totals. It's not clear what this transaction is -- it could be a web application responding to a web request, or a it could be a cli job, a queue job, etc. Without context it's hard to say.

Regardless -- the code in this transaction makes an HTTP request to

`http(s?)://some-service.example.com/api/v1/synthetic_trade_orders`

The service at /api/v1/synthetic_trade_orders reports that it begins processing the request 600ms later.

The first two things that pop to mind are this

First -- the Elastic Node.js agent should report its transaction time when the server actually begins processing the request. If the service at /api/v1/synthetic_trade_orders has a saturated event loop/microtask queue it may be that this trace represent a legitimate problem where /api/v1/synthetic_trade_orders couldn't begin processing the transaction until that time. This would result is a slow to respond transaction. This may only be an occasional occurrence which is why you're seeing fast response times -- how often does this long trace show up?

Second -- you can sometimes see time gaps like this when the server time isn't synced. If you have the ability to check the time on the kube pods it would be good to rule out an NTP syncing issue.

Also -- which axios method are you using to make your request and which HTTP server are you using in these services? (There may be weird async issues in how these libraries are implemented so it'd be good to rule that out)

bcollins · November 19, 2020, 4:15pm

Hi @alanstorm, thank you for your response.

More clarification:
brokerage_account_update_totals is a handler for a rabbitmq message, which is why I wrapped in custom instrumention.

Overall the services are under low load and calling /api/v1/synthetic_trade_orders directly results in a fast response, so I doubt it is event loop saturation. I am seeing these gaps quite frequently everywhere in both directions across multiple microservices.
I have confirmed the times between my pods are indeed correct.
I am using a simple axios.get(${config.services.api.host}/api/v1/synthetic_trade_orders) and I am using Hapi and Express (some microservices express, others have been refactored to hapi)

Here is another example. I have a very lightweight hapi service which is simply a socket.io server (with hapi) which is doing very little at the moment, with only 2 or 3 connection. I fire a POST call at it to send socket messages to users, while it sends out the actual socket message extremely quickly, there is still a gap (on both the request and response side):

bcollins · November 19, 2020, 6:19pm

and another example:

alanstorm · November 20, 2020, 6:14pm

Thanks for that additional context @bcollins -- nothing seems out of the ordinary in what you've described so far. Here's some other questions/thoughts that come to mind that might help us track this down. (Apologies if all/some of this is obvious, I err on the side of over communicating sometimes )

Overall the services are under low load and calling /api/v1/synthetic_trade_orders directly results in a fast response, so I doubt it is event loop saturation. I am seeing these gaps quite frequently everywhere in both directions across multiple microservices.

This definitely doesn't sound like event loop saturation from too much traffic -- but there may be some synchronous work that the service at ${config.services.api.host} is doing that blocks the node process from processing the next incoming request. Is api/v1/synthetic_trade_orders the only endpoint for this service? What sort of work does this service do? Is there a chance there's some long running synchronous work that might block the event-loop?

Re: event loop delay/blocking the agent has a metric that can detect this. If you chart nodejs.eventloop.delay.avg.ms over time via a Kibana Visualization you might be able to correlate event loop delays with these gaps you're seeing (or confirm that there's no event loop delay when these gaps occur)

Additionally -- what does your Transaction duration distribution graph look like? Are these long running traces outliers? Or are they showing up in every bucket during normal operations?

Finally -- you mentioned running some manual tests where everything "seemed fast". Have you been able to correlate these specific requests to specific traces? Often the sort of delays that these charts represent are outliers that only happen during specific blocking events with a service.

bcollins · November 20, 2020, 7:49pm

What actually led me to this path of instrumentation was indeed some blocking call, I found that one of the apps, when hit with a type of internal authentication was running bcrypt for every request. After I eliminated that I got much better response time and solved a serious gap. But this gap remains everywhere. The user-sockets service picture I posted above, that service is absolutely doing nothing that could block it's event loop, yet it also shows big gaps before and after the request.

nodejs.eventloop.delay.avg.ms is a very interesting thing to learn, I will investigate this next.

Also no they do not appear to be outliers, I am seeing them all the time. Sometimes they are quite a bit shorter but more often then not they aren't, and looking at the cpu metrics on kube the services do not appear to be using much cpu.

I will continue investigating with the info you've provided, but I am wondering, is there any way to get the instrumentor to instrument every single internal nodejs function call automatically?

EDIT: Am I doing this right? I'm not seeing any results...

EDIT: I got this query to work, however it only shows 1 of our apps with a semi high result... and this app should have no effect on the other instrumented services as it is completely independent.

bcollins · November 20, 2020, 8:59pm

Transactions graphs:

alanstorm · November 23, 2020, 11:50pm

@bcollins sorry to throw you in the graphing UI without any guidance.

It looks like from your screenshot you're graphing things by count, not time. Just in case you're still having trouble there -- your best two options for graphing this metric are

We're working on fixing our importable dashboards -- if you wanted to give the JSON file in that PR a try (Stack Management -> Kibana -> Saved Objects, click Import) it will give you a new dashboard that graphs a number of different Node.js metrics, including the event loop delay
Create a Lens Chart. Lens charts tend to be the easier to get started with -- just drag fields into the y axis and go.

is there any way to get the instrumentor to instrument every single internal nodejs function call automatically?

Unfortunately, No, there is not a way to do this. The performance cost of doing do would be prohibitive. The best we can do here for you is our startSpan API: Agent API | APM Node.js Agent Reference [master] | Elastic

With this you can add your own timings to your application and determine where the delays are. Hope that helps, and good luck!

system · December 14, 2020, 7:50pm

This topic was automatically closed 20 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
APM UI Showing gap between distributed transactions even when that much gap doesn't exist APM dotnet	13	746	October 20, 2020
Node Elastic APM agent. Transactions missing trace samples, timeline and metadata APM nodejs	1	460	October 16, 2023
Difference in data captured by APM agent (6.2.3) APM	4	571	July 12, 2018
APM - distirubted tracing in microservice-architectures APM	4	413	December 11, 2020
What causes gaps in span APM java	2	1026	February 17, 2020

Help understanding a gap in my trace

Related topics