Hi,
After upgrade from 7.3 to 7.4 APM-Servers are failing with common error
Failed to publish events: temporary bulk send failure and Queue is full
this is a very busy system and we have 6 APM servers and 5 ELK servers
During peak times we have around 1000-1500/s Total Events Rate per APM server
server config is: 8CPU, 40GB Ram, fast SSD/NVMe storage
i went tonz of time through this "common problem" page:
https://www.elastic.co/guide/en/apm/server/master/common-problems.html#queue-full
please let me know if i am doing something wrong and/or if i can do something to fix this Queue issue on ELK and/or APM server
i was playing with bulk_max_size and workers and queue.mem.events and at the end it always fails
i dont have any problem with logstash servers which are also heavily used
Many thanks
Tomislav
Kibana version: 7.4
Elasticsearch version: 7.4
APM Server version: 7.4
APM Agent language and version: elasticapm-java/1.10.0
Fresh install or upgraded from other version? Upgraded from 7.3
Oct 9 18:16:30 elksrv01 apm-server: 2019-10-09T18:16:30.709+0200#011ERROR#011pipeline/output.go:121#011Failed to publish events: temporary bulk send failure
Oct 9 18:16:30 elksrv01 apm-server: 2019-10-09T18:16:30.762+0200#011ERROR#011[request]#011middleware/log_middleware.go:74#011queue is full#011{"request_id": "54c3149c-15c8-4737-851c-cddbb1876799", "method": "POST", "URL": "/intake/v2/events", "content_length": 2955, "remote_address": "10.10.20.22", "user-agent": "elasticapm-java/1.10.0", "response_code": 503, "error": "queue is full"}
Oct 9 18:16:30 elksrv01 apm-server: 2019-10-09T18:16:30.813+0200#011ERROR#011[request]#011middleware/log_middleware.go:74#011queue is full#011{"request_id": "ed1882eb-0ea1-4b8d-985f-6db1bc3e02e7", "method": "POST", "URL": "/intake/v2/events", "content_length": 36128, "remote_address": "10.10.30.23", "user-agent": "elasticapm-java/1.10.0", "response_code": 503, "error": "queue is full"}
Oct 9 18:16:30 elksrv01 apm-server: 2019-10-09T18:16:30.891+0200#011ERROR#011pipeline/output.go:121#011Failed to publish events: temporary bulk send failure
Oct 9 18:16:30 elksrv01 apm-server: 2019-10-09T18:16:30.922+0200#011ERROR#011pipeline/output.go:121#011Failed to publish events: temporary bulk send failure
Oct 9 18:16:31 elksrv01 apm-server: 2019-10-09T18:16:31.424+0200#011ERROR#011[request]#011middleware/log_middleware.go:74#011queue is full#011{"request_id": "dc8715ad-d144-43fb-aee3-21f39f4d4872", "method": "POST", "URL": "/intake/v2/events", "content_length": -1, "remote_address": "10.10.20.27", "user-agent": "java-agent/1.6.1", "response_code": 503, "error": "queue is full"}
config file:
apm-server:
host: "10.10.10.11:8200"
idle_timeout: 60s
read_timeout: 45s
max_connections: 0
output.elasticsearch:
hosts:
- 10.10.10.11:9200
- 10.10.10.12:9200
- 10.10.10.13:9200
- 10.10.10.14:9200
- 10.10.10.15:9200
bulk_max_size: 10240
max_retries: 12
worker: 8
username: "elastic"
password: "XXXXXXXXXXXXXXXXXXXXX"
queue.mem.events: 81920
pipeline: "_none"
logging.to_syslog: false
logging.level: error
logging.to_files: true
logging.files:
path: /var/log/apm-server
name: apm-server
keepfiles: 7
permissions: 0644
xpack.monitoring.enabled: true
xpack.monitoring.elasticsearch:
Optional protocol and basic auth credentials.
#protocol: "https"
username: "apm_system"
password: "XXXXXXXXXXXXXXXXXXXX"