APM Failed to publish events: temporary bulk send failure / Queue is full 503 error

Hi,
After upgrade from 7.3 to 7.4 APM-Servers are failing with common error
Failed to publish events: temporary bulk send failure and Queue is full

this is a very busy system and we have 6 APM servers and 5 ELK servers

During peak times we have around 1000-1500/s Total Events Rate per APM server

server config is: 8CPU, 40GB Ram, fast SSD/NVMe storage

i went tonz of time through this "common problem" page:
https://www.elastic.co/guide/en/apm/server/master/common-problems.html#queue-full

please let me know if i am doing something wrong and/or if i can do something to fix this Queue issue on ELK and/or APM server

i was playing with bulk_max_size and workers and queue.mem.events and at the end it always fails

i dont have any problem with logstash servers which are also heavily used

Many thanks
Tomislav

Kibana version: 7.4
Elasticsearch version: 7.4
APM Server version: 7.4
APM Agent language and version: elasticapm-java/1.10.0
Fresh install or upgraded from other version? Upgraded from 7.3

Oct 9 18:16:30 elksrv01 apm-server: 2019-10-09T18:16:30.709+0200#011ERROR#011pipeline/output.go:121#011Failed to publish events: temporary bulk send failure
Oct 9 18:16:30 elksrv01 apm-server: 2019-10-09T18:16:30.762+0200#011ERROR#011[request]#011middleware/log_middleware.go:74#011queue is full#011{"request_id": "54c3149c-15c8-4737-851c-cddbb1876799", "method": "POST", "URL": "/intake/v2/events", "content_length": 2955, "remote_address": "10.10.20.22", "user-agent": "elasticapm-java/1.10.0", "response_code": 503, "error": "queue is full"}
Oct 9 18:16:30 elksrv01 apm-server: 2019-10-09T18:16:30.813+0200#011ERROR#011[request]#011middleware/log_middleware.go:74#011queue is full#011{"request_id": "ed1882eb-0ea1-4b8d-985f-6db1bc3e02e7", "method": "POST", "URL": "/intake/v2/events", "content_length": 36128, "remote_address": "10.10.30.23", "user-agent": "elasticapm-java/1.10.0", "response_code": 503, "error": "queue is full"}
Oct 9 18:16:30 elksrv01 apm-server: 2019-10-09T18:16:30.891+0200#011ERROR#011pipeline/output.go:121#011Failed to publish events: temporary bulk send failure
Oct 9 18:16:30 elksrv01 apm-server: 2019-10-09T18:16:30.922+0200#011ERROR#011pipeline/output.go:121#011Failed to publish events: temporary bulk send failure
Oct 9 18:16:31 elksrv01 apm-server: 2019-10-09T18:16:31.424+0200#011ERROR#011[request]#011middleware/log_middleware.go:74#011queue is full#011{"request_id": "dc8715ad-d144-43fb-aee3-21f39f4d4872", "method": "POST", "URL": "/intake/v2/events", "content_length": -1, "remote_address": "10.10.20.27", "user-agent": "java-agent/1.6.1", "response_code": 503, "error": "queue is full"}

config file:
apm-server:
host: "10.10.10.11:8200"
idle_timeout: 60s
read_timeout: 45s
max_connections: 0
output.elasticsearch:
hosts:
- 10.10.10.11:9200
- 10.10.10.12:9200
- 10.10.10.13:9200
- 10.10.10.14:9200
- 10.10.10.15:9200
bulk_max_size: 10240
max_retries: 12
worker: 8
username: "elastic"
password: "XXXXXXXXXXXXXXXXXXXXX"

queue.mem.events: 81920

pipeline: "_none"

logging.to_syslog: false
logging.level: error
logging.to_files: true
logging.files:
path: /var/log/apm-server
name: apm-server
keepfiles: 7
permissions: 0644

xpack.monitoring.enabled: true

xpack.monitoring.elasticsearch:

Optional protocol and basic auth credentials.

#protocol: "https"
username: "apm_system"
password: "XXXXXXXXXXXXXXXXXXXX"

Hi there!

Does this started to happen only after the 7.4 upgrade or was like that already before?

Hi Juan,

i can remember that when we started with APM in version 7.1 i think i had similar issue but it was primary related to huge number of request and only one server.. than.. we added additional APM systems and removed unneeded APM systems and it was working with same or even higher throughput last 2, 3 months.. on 7.2 and 7.3

so yes.. i am pretty sure that it started with 7.4 version..

i know that this temporary bulk send failure is common issue which i happening on beats and because of this issue it puzzles me even more.. I have pretty extensive knowledge of designing and sizing systems and to me there is no system metric which would lead me to conclusion that something is undersized on ELK side (we have logstash on two machines which are doing way bigger load on same ELK and i dont see bulk send failure there

on other hand.. maybe i am doing something wrong and maybe APM is not ready for such load (6xAPM servers x 1000-1500 events/s and i need to add additional APM and/or elastic nodes (processing only)

20y of some experiences teach me that with horizontal sizing i might solve problem from user perspective but problem will still be there.. and 1500 events/sec is not such a huge load

load on servers is never bigger than 2 (8 core server CPU E5-2690 v4, IBM svc Tier 1 storage, 40+GB RAM per node)..

is there something i am missing? like some additional statistics or monitoring which would give me info why this problem is happening in first place (queue full) like where problem on elastic node is..

also.. i am reading today posts here and i see new post from today similar to mine with more or less same problem..

i also have this problem that i need to restart APM.. process is up but it is not recovering..

maybe as a shot term solution to add some restart procedure if queue is full

example of state when it dies:

Thanks in advance
tomislav

Hi,
it died today again in 06:36 and in elasticsearch log i can see this message which corelates with time of error in APM

[2019-10-14T06:36:09,862][DEBUG][o.e.a.b.TransportBulkAction] [elksrv01] failed to execute pipeline [apm] for document [apm-7.4.0-transaction/_doc/null]
org.elasticsearch.ElasticsearchException: java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: 'unknown' is not an IP string literal.
at org.elasticsearch.ingest.CompoundProcessor.newCompoundProcessorException(CompoundProcessor.java:194) ~[elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:133) ~[elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.ingest.Pipeline.execute(Pipeline.java:100) ~[elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.ingest.IngestDocument.executePipeline(IngestDocument.java:652) ~[elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.ingest.PipelineProcessor.execute(PipelineProcessor.java:44) ~[elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:123) ~[elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.ingest.Pipeline.execute(Pipeline.java:100) ~[elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.ingest.IngestService.innerExecute(IngestService.java:427) ~[elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.ingest.IngestService.access$100(IngestService.java:70) ~[elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.ingest.IngestService$3.doRun(IngestService.java:355) [elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:773) [elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.4.0.jar:7.4.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:830) [?:?]
Caused by: java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: 'unknown' is not an IP string literal.
... 15 more
Caused by: java.lang.IllegalArgumentException: 'unknown' is not an IP string literal.
at org.elasticsearch.common.network.InetAddresses.forString(InetAddresses.java:335) ~[elasticsearch-7.4.0.jar:7.4.0]
at org.elasticsearch.ingest.geoip.GeoIpProcessor.execute(GeoIpProcessor.java:114) ~[?:?]
at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:123) ~[elasticsearch-7.4.0.jar:7.4.0]

latest config change:

bulk_max_size: 40960
worker: 64

queue.mem.events: 2621440
queue.mem.flush.min_events: 1000
queue.mem.flush.timeout: 1s

pipeline: "_none"

in apm logs i can see:
Oct 14 06:36:30 elksrv01 apm-server: 2019-10-14T06:36:30.642+0200#011ERROR#011pipeline/output.go:121#011Failed to publish events: temporary bulk send failure
Oct 14 06:36:30 elksrv01 apm-server: 2019-10-14T06:36:30.642+0200#011INFO#011pipeline/output.go:95#011Connecting to backoff(elasticsearch(http://x.x.x.x:9200))

After restart this pipeline error is no longer in logs..
i can recall that i saw this message before when APM stopped..

as you can see from config, i tried to disable this by: pipeline: "_none"

but as you can see it doesn't help

this must be somehow related
tomislav

Hi,

Thanks for the detailed report, your intuition is right and what is probably happening is that apm-server keeps retrying to index those illegal events.
This is a bug that slipped in 7.4.0 and was fixed in https://github.com/elastic/apm-server/pull/2772.

The fix will be released soon in 7.4.1, sorry for the problem!

In the meanwhile, you can try to set output.elasticsearch.max_retries to 0 and see if that alleviates the congestion a bit.

Hi,

many thanks for response and info that it will be solved in 7.4.1

i changed max_retries to 0 but not helping much

anyway.. i will wait for fix
when can we expect that so that i can inform managers and users?

thanks a lot
tomislav

Hi, unfortunately I don't have a date, but it will be released soon.

In case you didn't see it already in the link you posted above, another option is to use a processor to drop that field completely, as my colleague explained here:

Hope that helps and sorry again.

Hi,

just implemented this fix.. lets see

Thanks
Tomislav