APM: 503 Queue is full, server sleeping, nothing helps

lamka02sk · October 11, 2019, 8:33am

Hello,
We are trying to use APM to monitor our website but so far APM starts producing 503 Queue is full error after some time. After this happens it won't get back to normal, only restart of the APM service helps. The server is literally sleeping, CPU usage was around 15% and memory only 50% full. When I enabled it today at night, I also did performance tests and there was no problem with 900rpm but then it crashed at around 50rpm... All the performance settings seems to be useless. I don't think our traffic is so big that 4 CPUs and 12GB (half-used) can't handle it.
It throws this error no matter how big/moderate/conservative the values in configuration are...

Kibana version: 7.4

Elasticsearch version: 7.4

APM Server version: 7.4

Original install method (e.g. download page, yum, deb, from source, etc.) and version: Official 7.x repository

Fresh install or upgraded from other version? Upgraded from 7.2 before using APM

Is there anything special in your setup? No additional outputs except Elasticsearch

I left monitoring turned on, because it didn't crash for the first time so we knew what was going on:

Configuration:
######### APM Server Configuration ##########

############# APM Server ################

apm-server:
queue:
mem:
events: 150000
flush.min_events: 0
flush.timeout: 5s
max_procs: 4

#===== Outputs =====

#-------------------------- Elasticsearch output ------------------------------
output.elasticsearch:
hosts: ["ip:9200"]
worker: 2
bulk_max_size: 100000

simitt · October 13, 2019, 7:19am

Hi @lamka02sk,
we are investigating this. Could you check your server logs in the meantime, to check for any errors or warnings that are logged there.

lamka02sk · October 14, 2019, 8:20am

Hello @simitt, I am sorry, but I could not find anything in the logs from the time of crash. APM does not keep logs at all and Elasticsearch logs are almost empty except some unrelated stuff.

simitt · October 14, 2019, 9:50am

APM does not keep logs at all

Do you mean that you have disabled keeping logs? By default the APM Server does write to log files.

I assume you might encounter a similar bug to what we have seen in another discuss entry (APM Failed to publish events: temporary bulk send failure / Queue is full 503 error).
From 7.4 on apm pipelines are enabled by default, and a new field client.ip is indexed. Providing invalid data for fields that are part of the pipelines, can lead to errors and ingestion retries. This seems to happen in some cases for 7.4 for the client.ip field. There is a bug fix for this, that will be part of the next patch release for 7.4.
Until then I suggest you disable the pipeline, and remove the client.ip field from being ingested. You can do so by changing your apm-servery.yml file to include following settings:

output.elasticsearch.pipeline: "_none"
processors:
  - drop_fields:
      fields: ["client.ip"]
      ignore_missing: true

Hope this solves your issues, apologies for the inconveniences.

tmihaldinec · October 15, 2019, 8:08am

Hi,

i can confirm that it works now

many thanks
tomislav

rocketleap · October 21, 2019, 3:20pm

@simitt Any ETA for that patch? We have the same issue but unfortunately cannot apply that workaround because Elastic Cloud doesn't allow these settings to be set.

dnorth98 · October 21, 2019, 5:42pm

+1. Looks like we're in the same boat. Just been trying to tune the APM server for some bursts of "queue is full" and I think it's the same condition.

Is there a workaround for Elastic Cloud customers and/or a date for the fix?

rocketleap · October 23, 2019, 3:08pm

I can confirm that upgrading to 7.4.1 fixes the issue. It now works.

simitt · October 23, 2019, 5:46pm

@dnorth98 7.4.1 including the patch was released today, you can enable the pipeline again and get rid of dropping the field. Thanks for confirming it works as expected @rocketleap.

dnorth98 · October 23, 2019, 6:09pm

Thanks! Just updated the cluster and I'll keep an eye on the queue ingest errors for the next little bit.

system · November 13, 2019, 2:09pm

This topic was automatically closed 20 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
APM 7.6.2: Error 503: Queue is full but server is sleeping APM server	13	3032	June 16, 2020
Fine tune APM server settings on the hosted solution (cloud.elastic.co) APM server	2	549	May 30, 2019
APM queue is full APM server	12	5220	April 5, 2020
APM queue is full (output Kafka) APM server	5	744	December 24, 2020
APM Agent Queue is ful APM server	2	5213	September 9, 2019

APM: 503 Queue is full, server sleeping, nothing helps

Related topics