I have a problem with my APM Server.
Kibana version: 7.6.2
Elasticsearch version: 7.6.2
APM Server version: 7.6.2
Original install method (e.g. download page, yum, deb, from source, etc.) and version: download page
I'm running a stress test with Locust, sending 600 request per second to a cluster of APM Server, with 6 nodes.
Each request contains more than one event: transactions and spans.
I got them by intercepting some requests from an APM Agent.
APM Server receives that requests and it sends them to a ES cluster, directly to 3 data nodes (hot).
This is my apm-server.yml
apm-server:
host: "0.0.0.0:8200"
rum:
enabled: true
kibana:
enabled: true
host: "monitoring-kibana-00"
queue:
mem:
events: 368640
flush.min_events: 2048
flush.timeout: 1s
#-------------------------- Elasticsearch output --------------------------
output.elasticsearch:
hosts: ["monitoring-es-hot-00", "monitoring-es-hot-01", "monitoring-es-hot-02"]
indices:
- index: "apm-%{[observer.version]}-sourcemap"
when.contains:
processor.event: "sourcemap"
- index: "apm-%{[observer.version]}-transaction-%{[service.name]}-%{+yyyy.MM.dd}"
when.contains:
processor.event: "transaction"
- index: "apm-%{[observer.version]}-transaction-%{[service.name]}-%{+yyyy.MM.dd}"
when.contains:
processor.event: "span"
- index: "apm-%{[observer.version]}-transaction-%{[service.name]}-%{+yyyy.MM.dd}"
when.contains:
processor.event: "error"
- index: "apm-%{[observer.version]}-metric-%{[service.name]}-%{+yyyy.MM.dd}"
when.contains:
processor.event: "metric"
- index: "apm-%{[observer.version]}-onboarding-%{[service.name]}-%{+yyyy.MM.dd}"
when.contains:
processor.event: "onboarding"
bulk_max_size: 5120
worker: 36
When I start the test, it runs well until 600 rps aprox. After a while, I am getting 503 errors.
I see this in my logs:
2020-05-14T11:15:05.102-0400 ERROR pipeline/output.go:121 Failed to publish events: temporary bulk send failure
2020-05-14T11:15:05.219-0400 ERROR pipeline/output.go:121 Failed to publish events: temporary bulk send failure
2020-05-14T11:15:05.337-0400 ERROR pipeline/output.go:121 Failed to publish events: temporary bulk send failure
2020-05-14T11:15:07.005-0400 ERROR [request] middleware/log_middleware.go:95 queue is full {"request_id": "a1229fbd-cb1e-42a2-9733-b1909cfad1a2", "method": "POST", "URL": "/intake/v2/events
", "content_length": 4242, "remote_address": "10.194.40.232", "user-agent": "python-requests/2.23.0", "response_code": 503, "error": "queue is full"}
2020-05-14T11:15:07.005-0400 ERROR [request] middleware/log_middleware.go:95 queue is full {"request_id": "72ff2067-1132-4e2c-9582-1ca9d0d20249", "method": "POST", "URL": "/intake/v2/rum/ev
ents", "content_length": 3188, "remote_address": "10.194.16.223", "user-agent": "python-requests/2.23.0", "response_code": 503, "error": "queue is full"}
2020-05-14T11:15:07.020-0400 ERROR [request] middleware/log_middleware.go:95 queue is full {"request_id": "be709fed-8d17-497a-813c-fc58c5d8152c", "method": "POST", "URL": "/intake/v2/events
", "content_length": 2044, "remote_address": "10.194.49.55", "user-agent": "python-requests/2.23.0", "response_code": 503, "error": "queue is full"}
However, the APM servers and ES servers are very quite...
I try different configurations for my APM: more workers, more queue.mem.events, so on.. but I can not find what the problem is.
Can someone guide me to understand this, please?