Kibana version: 8.16.2
Elasticsearch version: 8.16.2
APM Server version: 7.17.25
APM Agent language and version: Java - 1.50.0, 1.43.0 and 1.41.0
Original install method (e.g. download page, yum, deb, from source, etc.) and version: docker image elastic/apm-server: elastic/apm-server - Docker Image
Fresh install or upgraded from other version? Fresh
Is there anything special in your setup?
- APM servers are running in Kubernetes.
- APM server docker image elastic/apm-server:7.17.25
- APM Server runs with:
-
runAsNonRoot: true -
seccompProfile: RuntimeDefault -
readOnlyRootFilesystem: true -
capabilities: drops ALL capabilities
-
- APM servers is deployed by ECK operator v
2.14.0 - CPU limits set (reproducible even with generous limits)
- APM servers are not managed by Fleet
- Elasticsearch is deployed in Virtual Machines behind a Load Balancer.
Description of the problem including expected versus actual behavior:
Since 1 month ago, on instance of our APM servers started to crash and Pod restarts.
On logs output we can see usual info from apm-server ingesting traces from java agents and flushing them to Elastic server. After a while (it can be minutes, or hours), the APM server crashes with the following error message:
fatal error: concurrent map iteration and map write
We’ve checked that there existed a previous issue regarding global labels sanitisation: Global label sanitisation may lead to concurrent map modification/access · Issue #8651 · elastic/apm-server · GitHub , in theory, the version of APM server 7.17.25 should ready contain this fix.
We’ve also observed that the server crashes even without receiving any telemetry (traces) from applications.
Steps to reproduce:
-
Run APM server 7.17.25 for few minutes
-
Send concurrent intake traffic:
-
From multiple Java services:
-
Sustained transaction and span throughput
-
Multiple concurrent HTTP intake connections
-
NDJSON intake containing:
-
metrics
-
transactions
-
spans
-
context labels (even minimal/static ones)
-
Traffic is continuous, not burst-only.
-
-
-
Observe crash
After running under load for some time (minutes, not necessarily immediate), APM Server crashes with:
fatal error: concurrent map iteration and map writewith stacktrace:
github.com/elastic/apm-server/model.sanitizeLabels
/go/src/github.com/elastic/apm-server/model/labels.go:32
github.com/elastic/apm-server/model.(*APMEvent).BeatEvent
github.com/elastic/apm-server/model.(*Batch).Transform
github.com/elastic/apm-server/publish.(*Publisher).run
Result
APM Server terminates due to a Go runtime panic caused by concurrent iteration and modification of a labels map.
Provide logs and/or server output (if relevant):
Log output and stack trace:
fatal error: concurrent map iteration and map write
goroutine 31 [running]:
github.com/elastic/apm-server/model.sanitizeLabels(0xc0023f0450)
/go/src/github.com/elastic/apm-server/model/labels.go:32 +0x74
github.com/elastic/apm-server/model.(*APMEvent).BeatEvent(0xc000c93178, {0xbf10652edb37ce70?, 0xd952ea0a4bb60eb7?})
/go/src/github.com/elastic/apm-server/model/apmevent.go:138 +0x10b0
github.com/elastic/apm-server/model.(*Batch).Transform(0xc0000145d0, {0x55619fac11d0, 0x5561a1ce6d60})
/go/src/github.com/elastic/apm-server/model/batch.go:51 +0x11b
github.com/elastic/apm-server/publish.(*Publisher).run(0xc0009c2460)
/go/src/github.com/elastic/apm-server/publish/pub.go:191 +0x4c
github.com/elastic/apm-server/publish.NewPublisher.func1()
/go/src/github.com/elastic/apm-server/publish/pub.go:118 +0x4d
created by github.com/elastic/apm-server/publish.NewPublisher in goroutine 27
/go/src/github.com/elastic/apm-server/publish/pub.go:116 +0x305
APM Server configuration:
output.elasticsearch:
hosts:
- "${ES_HOST1}"
- "${ES_HOST2}"
- "${ES_HOST3}"
username: "${ES_USERNAME}"
password: "${ES_PASSWORD}"
protocol: https
ssl:
certificate_authorities:
- /etc/ssl/certs/ca-bundle.crt
apm-server:
host: 0.0.0.0:8200
ssl:
enabled: true
certificate: "/usr/share/apm-server/config/apm-certs/tls.crt"
key: "/usr/share/apm-server/config/apm-certs/tls.key"
auth:
secret_token: ${APM_SERVER_TOKEN}
anonymous:
enabled: true
allow_agent: [rum-js]
rate_limit.event_limit: 300
rate_limit.ip_limit: 1000
rum:
enabled: false
allow_origins: ['*']
ilm:
enabled: "auto"
setup:
mapping:
- event_type: "error"
index_suffix: "team-env"
- event_type: "span"
index_suffix: "team-env"
- event_type: "transaction"
index_suffix: "team-env"
- event_type: "metric"
index_suffix: "team-env"
- event_type: "profile"
index_suffix: "team-env"