APM: 503 Queue is full, server sleeping, nothing helps

Hello,
We are trying to use APM to monitor our website but so far APM starts producing 503 Queue is full error after some time. After this happens it won't get back to normal, only restart of the APM service helps. The server is literally sleeping, CPU usage was around 15% and memory only 50% full. When I enabled it today at night, I also did performance tests and there was no problem with 900rpm but then it crashed at around 50rpm... All the performance settings seems to be useless. I don't think our traffic is so big that 4 CPUs and 12GB (half-used) can't handle it.
It throws this error no matter how big/moderate/conservative the values in configuration are...

Kibana version: 7.4

Elasticsearch version: 7.4

APM Server version: 7.4

Original install method (e.g. download page, yum, deb, from source, etc.) and version: Official 7.x repository

Fresh install or upgraded from other version? Upgraded from 7.2 before using APM

Is there anything special in your setup? No additional outputs except Elasticsearch

I left monitoring turned on, because it didn't crash for the first time so we knew what was going on:

Configuration:
######### APM Server Configuration ##########

############# APM Server ################

apm-server:
queue:
mem:
events: 150000
flush.min_events: 0
flush.timeout: 5s
max_procs: 4

#===== Outputs =====

#-------------------------- Elasticsearch output ------------------------------
output.elasticsearch:
hosts: ["ip:9200"]
worker: 2
bulk_max_size: 100000

Hi @lamka02sk,
we are investigating this. Could you check your server logs in the meantime, to check for any errors or warnings that are logged there.

Hello @simitt, I am sorry, but I could not find anything in the logs from the time of crash. APM does not keep logs at all and Elasticsearch logs are almost empty except some unrelated stuff.

APM does not keep logs at all

Do you mean that you have disabled keeping logs? By default the APM Server does write to log files.

I assume you might encounter a similar bug to what we have seen in another discuss entry (APM Failed to publish events: temporary bulk send failure / Queue is full 503 error).
From 7.4 on apm pipelines are enabled by default, and a new field client.ip is indexed. Providing invalid data for fields that are part of the pipelines, can lead to errors and ingestion retries. This seems to happen in some cases for 7.4 for the client.ip field. There is a bug fix for this, that will be part of the next patch release for 7.4.
Until then I suggest you disable the pipeline, and remove the client.ip field from being ingested. You can do so by changing your apm-servery.yml file to include following settings:

output.elasticsearch.pipeline: "_none"
processors:
  - drop_fields:
      fields: ["client.ip"]
      ignore_missing: true

Hope this solves your issues, apologies for the inconveniences.

Hi,

i can confirm that it works now

many thanks
tomislav

@simitt Any ETA for that patch? We have the same issue but unfortunately cannot apply that workaround because Elastic Cloud doesn't allow these settings to be set.

1 Like

+1. Looks like we're in the same boat. Just been trying to tune the APM server for some bursts of "queue is full" and I think it's the same condition.

Is there a workaround for Elastic Cloud customers and/or a date for the fix?

1 Like

I can confirm that upgrading to 7.4.1 fixes the issue. It now works.

1 Like

@dnorth98 7.4.1 including the patch was released today, you can enable the pipeline again and get rid of dropping the field. Thanks for confirming it works as expected @rocketleap.

Thanks! Just updated the cluster and I'll keep an eye on the queue ingest errors for the next little bit.

This topic was automatically closed 20 days after the last reply. New replies are no longer allowed.