APM-agent-nodejs prevents app start with error - 'metadata requests did not finish, possible deadlock'

We have a simple nodejs app with APM setup - events coming thru successfully deployed on a k8s cluster. Suddenly we are not able to bring the app up because we have started to receive this error - 'metadata requests did not finish, possible deadlock'

After spending some time without any luck on what is causing this - Searching thru github - I was able to pinpoint it to this location - recently updated code - apm-agent-nodejs/callback-coordination.js at d48e1285d5a21abaf8eab3ffa061934dde16dbb8 · elastic/apm-agent-nodejs · GitHub

But still it wasn't clear enough to understand what is causing it and what can we do to fix this?

Can somebody with context around this recently added code help?

Thanks,
Sim

Hey there @Sim_Singh, thanks for bringing this up.

The code you linked is related to a new feature in the Agent where we attempt to collect metadata about the local cloud environments (Azure, GCP, AWS, etc.) via. those service's cloud metadata endpoints (ex. AWS Metadata) and then report this data to your ElasticSearch instance along with other APM metrics and data.

The CallbackCoordination class where that error originates is responsible for coordinating the three network requests to each service's endpoint. The specific codepath you've hit is a fallback where, if those service's endpoints fail to respons within a certain amount of time, we give up trying to fetch that data.

The error you're seeing logged ("metadata requests did not finish, possible deadlock") should not be a fatal error. It should be logged and then the agent startup should proceed. If it's preventing your application from starting that's exactly the sort of bug we want to know about and fix.

First -- is the above the sort of context you were looking for and does it help? Or did we miss the mark?

Second, two things that may help you work around this

  1. Change the cloudProvider value to match the cloud provider whose service you're on, or change it to none. This may help alleviate the problem

  2. Temporarily fallback to using an older version of the agent without this metadata fetching functionality

Finally, could you let us know a little bit about your K8s environment? Are you running this on a service provided by one of the cloud vendors we mentioned, or is it in a home grown k8s that you're running? (or, is it home grown K8s BUT running on a cloud provider's infrastructure). We didn't see the behavior you're describing when we were testing this feature, so if it IS a fatal error for you we may have missed something that's unique to your environment. The more we know about the better we can help you diagnose the issue.

Hey Alan,

Thanks for quick response!

It is our home grown K8s running on AWS cloud provider's infrastructure. Our APM server version is 7.9.3, the elasticsearch version is 7.11.0. The kubernetes version we are running is 1.9

Ignore the following, our devops team was messing with the elastic server

Reverting back to previous version 3.10.0 starts to give us following error:

    Feb 26, 2021 @ 12:05:16.297	APM Server transport error (503): Unexpected APM Server response
	Feb 26, 2021 @ 12:05:16.297	APM Server accepted 0 events in the last request
	Feb 26, 2021 @ 12:05:16.297	Error: queue is full

And suddenly it started to work. Randomly after about 40 mins of deployment - it has started to work. Not sure why!

This topic was automatically closed 20 days after the last reply. New replies are no longer allowed.