APM Server 7.17.25 crashes: fatal error: concurrent map iteration and map write

Kibana version: 8.16.2

Elasticsearch version: 8.16.2

APM Server version: 7.17.25

APM Agent language and version: Java - 1.50.0, 1.43.0 and 1.41.0

Original install method (e.g. download page, yum, deb, from source, etc.) and version: docker image elastic/apm-server: elastic/apm-server - Docker Image

Fresh install or upgraded from other version? Fresh

Is there anything special in your setup?

  • APM servers are running in Kubernetes.
  • APM server docker image elastic/apm-server:7.17.25
  • APM Server runs with:
    • runAsNonRoot: true

    • seccompProfile: RuntimeDefault

    • readOnlyRootFilesystem: true

    • capabilities: drops ALL capabilities

  • APM servers is deployed by ECK operator v2.14.0
  • CPU limits set (reproducible even with generous limits)
  • APM servers are not managed by Fleet
  • Elasticsearch is deployed in Virtual Machines behind a Load Balancer.

Description of the problem including expected versus actual behavior:

Since 1 month ago, on instance of our APM servers started to crash and Pod restarts.

On logs output we can see usual info from apm-server ingesting traces from java agents and flushing them to Elastic server. After a while (it can be minutes, or hours), the APM server crashes with the following error message:

fatal error: concurrent map iteration and map write

We’ve checked that there existed a previous issue regarding global labels sanitisation: Global label sanitisation may lead to concurrent map modification/access · Issue #8651 · elastic/apm-server · GitHub , in theory, the version of APM server 7.17.25 should ready contain this fix.

We’ve also observed that the server crashes even without receiving any telemetry (traces) from applications.

Steps to reproduce:

  1. Run APM server 7.17.25 for few minutes

  2. Send concurrent intake traffic:

    1. From multiple Java services:

      • Sustained transaction and span throughput

      • Multiple concurrent HTTP intake connections

      • NDJSON intake containing:

        • metrics

        • transactions

        • spans

        • context labels (even minimal/static ones)

      Traffic is continuous, not burst-only.

  3. Observe crash

    After running under load for some time (minutes, not necessarily immediate), APM Server crashes with:

    fatal error: concurrent map iteration and map write
    

    with stacktrace:
    github.com/elastic/apm-server/model.sanitizeLabels
    /go/src/github.com/elastic/apm-server/model/labels.go:32
    github.com/elastic/apm-server/model.(*APMEvent).BeatEvent
    github.com/elastic/apm-server/model.(*Batch).Transform
    github.com/elastic/apm-server/publish.(*Publisher).run

Result

APM Server terminates due to a Go runtime panic caused by concurrent iteration and modification of a labels map.

Provide logs and/or server output (if relevant):

Log output and stack trace:

fatal error: concurrent map iteration and map write

goroutine 31 [running]:
github.com/elastic/apm-server/model.sanitizeLabels(0xc0023f0450)
	/go/src/github.com/elastic/apm-server/model/labels.go:32 +0x74
github.com/elastic/apm-server/model.(*APMEvent).BeatEvent(0xc000c93178, {0xbf10652edb37ce70?, 0xd952ea0a4bb60eb7?})
	/go/src/github.com/elastic/apm-server/model/apmevent.go:138 +0x10b0
github.com/elastic/apm-server/model.(*Batch).Transform(0xc0000145d0, {0x55619fac11d0, 0x5561a1ce6d60})
	/go/src/github.com/elastic/apm-server/model/batch.go:51 +0x11b
github.com/elastic/apm-server/publish.(*Publisher).run(0xc0009c2460)
	/go/src/github.com/elastic/apm-server/publish/pub.go:191 +0x4c
github.com/elastic/apm-server/publish.NewPublisher.func1()
	/go/src/github.com/elastic/apm-server/publish/pub.go:118 +0x4d
created by github.com/elastic/apm-server/publish.NewPublisher in goroutine 27
	/go/src/github.com/elastic/apm-server/publish/pub.go:116 +0x305

APM Server configuration:

output.elasticsearch:
  hosts: 
      - "${ES_HOST1}"
      - "${ES_HOST2}"
      - "${ES_HOST3}"
  username: "${ES_USERNAME}"
  password: "${ES_PASSWORD}"
  protocol: https
  ssl:
      certificate_authorities:
      - /etc/ssl/certs/ca-bundle.crt
apm-server:
  host: 0.0.0.0:8200
  ssl:
      enabled: true
      certificate: "/usr/share/apm-server/config/apm-certs/tls.crt"
      key: "/usr/share/apm-server/config/apm-certs/tls.key"
  auth:
      secret_token: ${APM_SERVER_TOKEN}
  anonymous:
    enabled: true
    allow_agent: [rum-js]
    rate_limit.event_limit: 300
    rate_limit.ip_limit: 1000
  rum:
      enabled: false
      allow_origins: ['*']
  ilm:
    enabled: "auto"
    setup:
      mapping:
        - event_type: "error"
          index_suffix: "team-env"
        - event_type: "span"
          index_suffix: "team-env"
        - event_type: "transaction"
          index_suffix: "team-env"
        - event_type: "metric"
          index_suffix: "team-env"
        - event_type: "profile"
          index_suffix: "team-env"