We have a simple nodejs app with APM setup - events coming thru successfully deployed on a k8s cluster. Suddenly we are not able to bring the app up because we have started to receive this error - 'metadata requests did not finish, possible deadlock'
Hey there @Sim_Singh, thanks for bringing this up.
The code you linked is related to a new feature in the Agent where we attempt to collect metadata about the local cloud environments (Azure, GCP, AWS, etc.) via. those service's cloud metadata endpoints (ex. AWS Metadata) and then report this data to your ElasticSearch instance along with other APM metrics and data.
The CallbackCoordination class where that error originates is responsible for coordinating the three network requests to each service's endpoint. The specific codepath you've hit is a fallback where, if those service's endpoints fail to respons within a certain amount of time, we give up trying to fetch that data.
The error you're seeing logged ("metadata requests did not finish, possible deadlock") should not be a fatal error. It should be logged and then the agent startup should proceed. If it's preventing your application from starting that's exactly the sort of bug we want to know about and fix.
First -- is the above the sort of context you were looking for and does it help? Or did we miss the mark?
Second, two things that may help you work around this
Change the cloudProvider value to match the cloud provider whose service you're on, or change it to none. This may help alleviate the problem
Temporarily fallback to using an older version of the agent without this metadata fetching functionality
Finally, could you let us know a little bit about your K8s environment? Are you running this on a service provided by one of the cloud vendors we mentioned, or is it in a home grown k8s that you're running? (or, is it home grown K8s BUT running on a cloud provider's infrastructure). We didn't see the behavior you're describing when we were testing this feature, so if it IS a fatal error for you we may have missed something that's unique to your environment. The more we know about the better we can help you diagnose the issue.
It is our home grown K8s running on AWS cloud provider's infrastructure. Our APM server version is 7.9.3, the elasticsearch version is 7.11.0. The kubernetes version we are running is 1.9
Ignore the following, our devops team was messing with the elastic server
Reverting back to previous version 3.10.0 starts to give us following error:
Feb 26, 2021 @ 12:05:16.297 APM Server transport error (503): Unexpected APM Server response
Feb 26, 2021 @ 12:05:16.297 APM Server accepted 0 events in the last request
Feb 26, 2021 @ 12:05:16.297 Error: queue is full
And suddenly it started to work. Randomly after about 40 mins of deployment - it has started to work. Not sure why!
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.