APM Server timing out under heavy load

Category:
-> TCP Traffic / Dimensioning.
Problem:
Nodes are not responding TCP duriing peak hours (self recovery, affects inbound traffic thought)

Kibana version:
(not used for config)

Elasticsearch version:
Currently writing to logstash: 7.5.1

APM Server version:
7.5.1

APM Agent language and version:
(several). java agents mainly, plus latest RUM agents.

Browser version:
(several).

Original install method (e.g. download page, yum, deb, from source, etc.) and version:

rpm -qa |grep apm

apm-server-7.5.1-1.x86_64

Fresh install or upgraded from other version?
(upgraded)

Is there anything special in your setup? For example, are you using the Logstash or Kafka outputs?

load balancer -> APM-Servers -> Logstash -> Elasticsearch

Load Balancer infront of APM server Farm (8 nodes). Load Balancer does a tCP-half open health check that fails during peak hours.

Icinga monitoring checks on TCP status of port 8200 of each node. During Peak hours we have both, TCP timeouts on apm servers and large delays on tcp responses from all of the farm members (see tcp stats below)

APM servers write to Logstash nodes directly

Description of the problem including expected versus actual behavior. Please include screenshots (if relevant):

Current Behaviour; during peak load hour, I get TCP Timeouts on port 8200, losing and unknown amount of RUM and APM traffic.
I do not get any indicators of the actual problem as CPU user time remains below 20%, memory footprint is the same, regardless of the traffic, TCP timeouts are not logged.

Expected behaviour: (a) Information on if there is further fine tuning possible, as There are No CPU / Memory issues on nodes (see graphs belows), and apm limits have been tuned as per apm recomendations

There are no memory or CPU overloads on the APM Farm during peak hours. APM Traffic exists, but lots(some ??) of it is not received at server due to TCP time outs

Steps to reproduce:

  1. Heavy application load causes it on my infrastructure. No issues on low load conditions.

Errors in browser console (if relevant):

Provide logs and/or server output (if relevant):
Logs from peak hours reflecting errors (one node only):

root@xxx# cat apm-server.log | sed "s/.*response_code// " | cut -d: -f2 | sed "s/}//" | sort | uniq -c
875 200
44697 202
55 401, "error"
106 403, "error"
4 429, "error"
1 500, "error"
root@xxx# cat apm-server.log.1 | sed "s/.*response_code// " | cut -d: -f2 | sed "s/}//" | sort | uniq -c
2498 200
127482 202
143 401, "error"
245 403, "error"
9 429, "error"
5 500, "error"

401 and 403 errors have been checked, there are configuration issues on some of the agent (no keys, no autoconfig from kibana enabled) but expected.
429 errors have been trobleshooted by expanding limits of event rate on node configuation:

#---------------------------- APM Server - RUM Real User Monitoring ----------------------------

Enable Real User Monitoring (RUM) Support. By default RUM is disabled.

rum:
enabled: true

event_rate:
  limit: 1000
  lru_size: 8000

allow_origins: [ '...

sample 500 Errors:

apm-server.log:2020-10-05T04:47:07.947-0400 ERROR [request] middleware/log_middleware.go:74 internal error {"request_id": "9a79369b-7989-45be-924c-793c95e53e31", "method": "POST", "URL": "/intake/v2/rum/events", "content_length": 11974, "remote_address": "xxx.xxx.xxx.xxx", "user-agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 13_7 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.2 Mobile/15E148 Safari/604.1", "response_code": 500, "error": "read tcp xxx.xxx.xxx.xxx:8200->xxx.xxx.xxx.xxx:35968: read: connection reset by peer"}

2020-10-05T04:02:33.922-0400 ERROR [request] middleware/log_middleware.go:74 internal error {"request_id": "1653857a-3855-40e5-821c-542fbb9e3325", "method": "POST", "URL": "/intake/v2/rum/events", "content_length": 8196, "remote_address": "xxx.xxx.xxx.xxx", "user-agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 14_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1", "response_code": 500, "error": "read tcp xxx.xxx.xxx.xxx:8200->xxx.xxx.xxx.xxx:19925: read: connection reset by peer"}
2020-10-05T04:24:43.255-0400 ERROR [request] middleware/log_middleware.go:74 internal error {"request_id": "5ee51918-1cef-4618-b585-8b5da34482a1", "method": "POST", "URL": "/intake/v2/rum/events", "content_length": 3726, "remote_address": "xxx.xxx.xxx.xxx", "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1 Safari/605.1.15", "response_code": 500, "error": "read tcp xxx.xxx.xxx.xxx:8200->xxx.xxx.xxx.xxx:6896: i/o timeout"}
2020-10-05T04:32:24.261-0400 ERROR [request] middleware/log_middleware.go:74 internal error {"request_id": "8e8a62d5-7a8e-453d-b846-001fcd692116", "method": "POST", "URL": "/intake/v2/rum/events", "content_length": 5608, "remote_address": "xxx.xxx.xxx.xxx", "user-agent": "Mozilla/5.0 (Linux; Android 9) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/87.0.4263.2 Mobile DuckDuckGo/5 Safari/537.36", "response_code": 500, "error": "context canceled"}

Can you tell if there is a limit on how many TCP connections can a single node handle, as to adjust dimensionting of the farm? Are threre any default process limits that might need overriding?

Each node is running 4 cures, but CPU load and CPU user time points to releasing 2 cores in favour of expanding size fam.

Node Limits

ulimit -a

core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 15066
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 4096
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited