503 return codes in APM

Description of the problem including expected versus actual behavior. Please include screenshots (if relevant):

We've started getting 503 return codes in our APM application. The application is written in Node.js. Here the error messages we're seeing.

Elastic APM HTTP error (503): queue is full: queue is full
Elastic APM HTTP error (503): timeout waiting to be processed

These errors occur about 20 times over the course of a day.

Kibana version: 6.6.1

Elasticsearch version: 6.6.1

APM Server version: 6.6.1

APM Agent language and version: Node.js, 2.7.0

Browser version: Chrome 73.x

Original install method (e.g. download page, yum, deb, from source, etc.) and version: YUM

Fresh install or upgraded from other version? Fresh install

What's the best way to narrow down the source of the 503 return codes? I do not see the exact error message in the apm-server source.

apm-server(master): pt 'queue is full'
./beater/test_approved_stream_result/TestRequestIntegrationFullQueue.approved.json
5:            "message": "queue is full"

./beater/common_handler.go
129:			err:     errors.Wrap(err, "queue is full"),

./processor/stream/test_approved_stream_result/testIntegrationResultQueueFull.approved.json
5:            "message": "queue is full"

./docs/events-api.asciidoc
61:For example: queue is full, IP rate limit reached, wrong metadata, etc.
81:      "message": "queue is full" <3>

./publish/pub.go
58:	ErrFull          = errors.New("queue is full")
119:// Send tries to forward pendingReq to the publishers worker. If the queue is full,

./vendor/golang.org/x/sys/unix/zerrors_darwin_386.go
1743:	{106, "EQFULL", "interface output queue is full"},

./vendor/golang.org/x/sys/unix/zerrors_darwin_arm64.go
1743:	{106, "EQFULL", "interface output queue is full"},

./vendor/golang.org/x/sys/unix/zerrors_darwin_amd64.go
1743:	{106, "EQFULL", "interface output queue is full"},

./vendor/golang.org/x/sys/unix/zerrors_darwin_arm.go
1743:	{106, "EQFULL", "interface output queue is full"},

Hi @mikemadden42

It sounds like an error from the APM Server correct?

Have you looked at Common Problems section in the docs? There's a section about how to troubleshoot 503 errors.

Best,
Thomas

Thanks @wa7son. I did look over the Common Problems document. The documents implies that if we only received 503 return codes, an Elasticsearch disk may be full. We are only receiving 503 returns when the error occurs, but I've verified we have plenty of space on our Elastic Stack cluster.

It sounds a little like the APM Server can't keep up with the amount of data that's being sent to it. Do you know approximately how many HTTP requests and events it's receiving?

If that's the issue, the recommended solution is to spin up multiple APM Servers behind a load balancer.

Hi @wa7son, I'm leaning towards the same conclusion. Perhaps, our single APM server cannot keep up with the requests. In our production cluster, we've received right at 56 million events over the past 24 hours. Over the past week, we've received almost 240 million events.

Do you think it's worth tuning our existing APM server? We've left it pretty default.

root@apmsrv ~]# egrep -v '^[[:blank:]]*#|^$' /etc/apm-server/apm-server.yml
apm-server:
  host: "apmsrv.somedomain.com:8200"
  frontend:
    enabled: false
  ssl.enabled: true
  ssl.certificate : "/etc/pki/tls/certs/apmsrv.crt"
  ssl.key : "/etc/pki/tls/private/apmsrv.key.pem"
setup.template.settings:
  index.number_of_shards: 2
  index.codec: best_compression
setup.kibana:
  host: "https://kibana.somedomain.com:5601"
output.elasticsearch:
  hosts: ["ingest01.somedomain.com:9200", "ingest02.somedomain.com:9200"]
  protocol: "https"
  username: "elastic"
  password: ${elastic_pass}
apm-server.rum.enabled: true
apm-server.rum.rate_limit: 10
apm-server.rum.allow_origins: ['*']
apm-server.rum.library_pattern: "node_modules|bower_components|~"
apm-server.rum.exclude_from_grouping: "^/webpack"
apm-server.rum.source_mapping.cache.expiration: 5m
apm-server.rum.source_mapping.index_pattern: "apm-*-sourcemap*"

The defaults are rather conservative. It's almost certainly worth tuning your APM server settings depending on what kind of hardware you are running APM server on.

You can directly increase the size of the queue that's being filled up using queue.mem.events. You'll also want to adjust your workers accordingly. That document also describes many other tunables you can adjust for your workload.

I'd also suggest enabling monitoring if possible to help guide some of these changes.

This topic was automatically closed 20 days after the last reply. New replies are no longer allowed.