Elastic Cloud APM server - "queue is full"

APM Server version:

Elastic Cloud - deployment version 7.9.2.

APM Agent language and version:

APM python agent, version 5.9.0
python 3.7.

Description of the problem including expected versus actual behavior. Please include screenshots (if relevant):

We have a python Flask app (running on uwsgi behind nginx) that's APM-enabled, i.e. we've installed the elastic-apm[flask] module, and are enabling it through the following lines:

from elasticapm.contrib.flask import ElasticAPM
from our_flask_setup import app

apm = ElasticAPM(app)

The config for where we're shipping the APM data is done through setting the env vars ELASTIC_APM_SERVICE_NAME, ELASTIC_APM_SERVER_URL and ELASTIC_APM_SECRET_TOKEN.

During a certain time frame last week, our uwsgi workers stopped responding to requests, and the uwsgi queue started building up. After some time uwsgi started dropping requests due to the queue filling up. This coincided with the following lines in our kibana logs (which are shipped through another measure, not the APM lib):

elasticapm.transport.exceptions.TransportException: HTTP 503: {"accepted":0,"errors":[{"message":"queue is full"}]}

As I understand it, the "queue is full" message is something that is produced server-side on the Elastic Cloud's backend, when we try to ship APM metrics.

During the same time period the python lib also says

elasticapm.transport - dropping flushed data due to transport failure back-off

I.e. the python process' inmem queue is full, so we're starting to drop data. Which of course is fine, it's just strengthening my suspicion that something gone wrong on the server-side.

Without having any detailed knowledge of the python APM metrics shipping library, my interpretation from our logs is that the python lib failed to ship metrics. At the same time, due to some reason, the python appserver's queues started growing due to us serving responses more slowly than the number of requests coming in. (The number of requests was quite stable at the time.)

My suspicion here is that the problems with metrics shipping process made our request serving slower and slower, which led to this queue growth and subsequent resource starvation.

  1. Is this even remotely probable? Or am I barking up the wrong tree?
  2. Have you seen this issue before?
  3. If yes on the above questions - what can I do to make sure that this doesn't happen again? (I'm thinking I could implement a circuit breaker such that we disable APM metrics collection entirely if we get the queue is full message, but it's quite an endevour to do if I'm barking up the wrong tree. Additionally, the current inmem-queue and dropping of messages if said queue is full, should be able to remedy this issue.)

Hi @thomaslundstrom,
when the APM Server returns a 503 - queue is full message, it basically means that the server is receiving more events to ingest than it can currently handle. This can have a couple of causes, depending on which you might want to e.g. change the sampling rate, scale the APM Server instance or take a look at the server configuration settings. Please refer to the common-problems#queue-is-full section for more pointers.

Hi @simitt,

Thanks for your response. I've understood as much as well - we're throwing too much at our poor APM server.

I have no problems with the actual response from the server when we overload it - once in a while there is more traffic than a system can handle. In the event of events (pun intended) or metrics, it's Ok to drop requests, since more will come. :stuck_out_tongue:

What I'm after is that I'm suspecting that the python client library for APM might not be handling the situation especially well. As I tried to explain above, I'm seeing correlation between "queue is full" log statements and our internal request pipelines being filled (and overloaded). What I suspect is that when this happens, the python client library is having a hard time, making all HTTP requests to our Flask app take more time than they usually do, which is giving us resource constraints.

Do you know of any other customers that have had similar problems?

Hi @thomaslundstrom! I work on the python agent.

All of the queue management and sending to the APM server happens in a background thread, and shouldn't be blocking in any way. It's possible that the requests to the APM server could be slower when the server is overloaded, but because waiting for those requests is just I/O, I wouldn't expect it to slow your app at all.

At the same time, due to some reason, the python appserver's queues started growing due to us serving responses more slowly than the number of requests coming in. (The number of requests was quite stable at the time.)

The problem with this situation is that the tool you would be using to diagnose slow response times in your app is dropping those useful data due to an overloaded APM server. Are you noting slower responses via synthetics on the client side?

This topic was automatically closed 20 days after the last reply. New replies are no longer allowed.