APM Server version:
Elastic Cloud - deployment version 7.9.2.
APM Agent language and version:
APM python agent, version 5.9.0
python 3.7.
Description of the problem including expected versus actual behavior. Please include screenshots (if relevant):
We have a python Flask app (running on uwsgi behind nginx) that's APM-enabled, i.e. we've installed the elastic-apm[flask]
module, and are enabling it through the following lines:
from elasticapm.contrib.flask import ElasticAPM
from our_flask_setup import app
apm = ElasticAPM(app)
The config for where we're shipping the APM data is done through setting the env vars ELASTIC_APM_SERVICE_NAME
, ELASTIC_APM_SERVER_URL
and ELASTIC_APM_SECRET_TOKEN
.
During a certain time frame last week, our uwsgi workers stopped responding to requests, and the uwsgi queue started building up. After some time uwsgi started dropping requests due to the queue filling up. This coincided with the following lines in our kibana logs (which are shipped through another measure, not the APM lib):
elasticapm.transport.exceptions.TransportException: HTTP 503: {"accepted":0,"errors":[{"message":"queue is full"}]}
As I understand it, the "queue is full" message is something that is produced server-side on the Elastic Cloud's backend, when we try to ship APM metrics.
During the same time period the python lib also says
elasticapm.transport - dropping flushed data due to transport failure back-off
I.e. the python process' inmem queue is full, so we're starting to drop data. Which of course is fine, it's just strengthening my suspicion that something gone wrong on the server-side.
Without having any detailed knowledge of the python APM metrics shipping library, my interpretation from our logs is that the python lib failed to ship metrics. At the same time, due to some reason, the python appserver's queues started growing due to us serving responses more slowly than the number of requests coming in. (The number of requests was quite stable at the time.)
My suspicion here is that the problems with metrics shipping process made our request serving slower and slower, which led to this queue growth and subsequent resource starvation.
- Is this even remotely probable? Or am I barking up the wrong tree?
- Have you seen this issue before?
- If yes on the above questions - what can I do to make sure that this doesn't happen again? (I'm thinking I could implement a circuit breaker such that we disable APM metrics collection entirely if we get the queue is full message, but it's quite an endevour to do if I'm barking up the wrong tree. Additionally, the current inmem-queue and dropping of messages if said queue is full, should be able to remedy this issue.)