APM Server 7.17.25 crashes: fatal error: concurrent map iteration and map write

Kibana version: 8.16.2

Elasticsearch version: 8.16.2

APM Server version: 7.17.25

APM Agent language and version: Java - 1.50.0, 1.43.0 and 1.41.0

Original install method (e.g. download page, yum, deb, from source, etc.) and version: docker image elastic/apm-server: elastic/apm-server - Docker Image

Fresh install or upgraded from other version? Fresh

Is there anything special in your setup?

  • APM servers are running in Kubernetes.
  • APM server docker image elastic/apm-server:7.17.25
  • APM Server runs with:
    • runAsNonRoot: true

    • seccompProfile: RuntimeDefault

    • readOnlyRootFilesystem: true

    • capabilities: drops ALL capabilities

  • APM servers is deployed by ECK operator v2.14.0
  • CPU limits set (reproducible even with generous limits)
  • APM servers are not managed by Fleet
  • Elasticsearch is deployed in Virtual Machines behind a Load Balancer.

Description of the problem including expected versus actual behavior:

Since 1 month ago, on instance of our APM servers started to crash and Pod restarts.

In logs output we can see usual info from apm-server ingesting traces from java agents and flushing them to Elastic server. After a while (it can be minutes, or hours), the APM server crashes with the following error message:

fatal error: concurrent map iteration and map write

We’ve checked that there existed a previous issue regarding global labels sanitisation: Global label sanitisation may lead to concurrent map modification/access · Issue #8651 · elastic/apm-server · GitHub , in theory, the version of APM server 7.17.25 should ready contain this fix.

We’ve also observed that the server crashes even without receiving any telemetry (traces) from applications.

Steps to reproduce:

  1. Run APM server 7.17.25 for few minutes

  2. Send concurrent intake traffic:

    1. From multiple Java services:

      • Sustained transaction and span throughput

      • Multiple concurrent HTTP intake connections

      • NDJSON intake containing:

        • metrics

        • transactions

        • spans

        • context labels (even minimal/static ones)

      Traffic is continuous, not burst-only.

  3. Observe crash

    After running under load for some time (minutes, not necessarily immediate), APM Server crashes with:

    fatal error: concurrent map iteration and map write
    

    with stacktrace:
    github.com/elastic/apm-server/model.sanitizeLabels
    /go/src/github.com/elastic/apm-server/model/labels.go:32
    github.com/elastic/apm-server/model.(*APMEvent).BeatEvent
    github.com/elastic/apm-server/model.(*Batch).Transform
    github.com/elastic/apm-server/publish.(*Publisher).run

Result

APM Server terminates due to a Go runtime panic caused by concurrent iteration and modification of a labels map.

Provide logs and/or server output (if relevant):

Log output and stack trace:

fatal error: concurrent map iteration and map write

goroutine 31 [running]:
github.com/elastic/apm-server/model.sanitizeLabels(0xc0023f0450)
	/go/src/github.com/elastic/apm-server/model/labels.go:32 +0x74
github.com/elastic/apm-server/model.(*APMEvent).BeatEvent(0xc000c93178, {0xbf10652edb37ce70?, 0xd952ea0a4bb60eb7?})
	/go/src/github.com/elastic/apm-server/model/apmevent.go:138 +0x10b0
github.com/elastic/apm-server/model.(*Batch).Transform(0xc0000145d0, {0x55619fac11d0, 0x5561a1ce6d60})
	/go/src/github.com/elastic/apm-server/model/batch.go:51 +0x11b
github.com/elastic/apm-server/publish.(*Publisher).run(0xc0009c2460)
	/go/src/github.com/elastic/apm-server/publish/pub.go:191 +0x4c
github.com/elastic/apm-server/publish.NewPublisher.func1()
	/go/src/github.com/elastic/apm-server/publish/pub.go:118 +0x4d
created by github.com/elastic/apm-server/publish.NewPublisher in goroutine 27
	/go/src/github.com/elastic/apm-server/publish/pub.go:116 +0x305

APM Server configuration:

output.elasticsearch:
  hosts: 
      - "${ES_HOST1}"
      - "${ES_HOST2}"
      - "${ES_HOST3}"
  username: "${ES_USERNAME}"
  password: "${ES_PASSWORD}"
  protocol: https
  ssl:
      certificate_authorities:
      - /etc/ssl/certs/ca-bundle.crt
apm-server:
  host: 0.0.0.0:8200
  ssl:
      enabled: true
      certificate: "/usr/share/apm-server/config/apm-certs/tls.crt"
      key: "/usr/share/apm-server/config/apm-certs/tls.key"
  auth:
      secret_token: ${APM_SERVER_TOKEN}
  anonymous:
    enabled: true
    allow_agent: [rum-js]
    rate_limit.event_limit: 300
    rate_limit.ip_limit: 1000
  rum:
      enabled: false
      allow_origins: ['*']
  ilm:
    enabled: "auto"
    setup:
      mapping:
        - event_type: "error"
          index_suffix: "team-env"
        - event_type: "span"
          index_suffix: "team-env"
        - event_type: "transaction"
          index_suffix: "team-env"
        - event_type: "metric"
          index_suffix: "team-env"
        - event_type: "profile"
          index_suffix: "team-env"

1 Like

Hi @jpescalona ,

Thanks for the report. I appreciate that you’ve already checked for any existing issues.

Considering 7.17 is EOL (see Elastic Product End of Life Dates | Elastic), do you mind upgrading APM server to match your Elasticsearch version, 8.16.2 in this case, and see if the problem persists?

Hi @Carson_Ip thanks for replying, we have both versions actually running, but regrettably we still have some customers with really old java agent versions, and their upgrade are not feasible at this moment.

We found a mitigation to avoid APM server 7 continuously crashing, at least, reduce the recurrence of these crashes, by setting a Nginx as proxy between app APM agents and APM server 7. This reduced the number of crashes almost to none.

My assumption is meanwhile the Nginx proxy is slowing down the traces request ingestion, it makes less often to reproduce concurrent map iteration issue by reducing the concurrence of ingesting traces or metrics with large payload and lots of metadata entries.

Thanks for letting me know that you have found a workaround for it. I can confirm that this specific piece of code is not present in 8.x versions, but unfortunately I cannot reproduce your issue in 7.17.25 in isolation.