Random appear io timeout error: `Unable to reach APM Server(): Read timed out`

Kibana version: 7.10.1

Elasticsearch version: 7.10.1

APM Server version:7.10.1

APM Agent language and version: apm-agent-python 5.10.1

Browser version:

Original install method (e.g. download page, yum, deb, from source, etc.) and version: docker

Fresh install or upgraded from other version?

Is there anything special in your setup?

Use k8s deploy apm-server, kibana,elasticsearch. Ingress use nginx-ingress.

apm-server configmap

apm-server.yml:
  host: 0.0.0.0:8200
  max_event_size: 10485760
  rum:
    enabled: true
    allow_origins: ['*']
queue.mem.events: 12287
output.elasticsearch:
  worker: 3
  bulk_max_size: 4096
  hosts: ["elasticsearch:9200"]
logging.level: error
apm-server.ilm: 
  enabled: "auto"

apm-agent-python config

ELASTIC_APM_SERVER_URL="host:80"
ELASTIC_APM_TRANSACTIONS_IGNORE_PATTERNS="'^OPTIONS '"
ELASTIC_APM_CENTRAL_CONFIG="False"
ELASTIC_APM_CAPTURE_BODY="transactions"
ELASTIC_APM_TRANSACTION_SAMPLE_RATE="0.2"
ELASTIC_APM_DJANGO_TRANSACTION_NAME_FROM_ROUTE="True"

Description of the problem including expected versus actual behavior. Please include screenshots (if relevant):
Every once in a while,all project use apm-agent-python will report timeout error. but apm-server is normal and dont found related error report。

Steps to reproduce:

  1. restart apm-server (i think this because apm-server is not graceful shutdown)
  2. random appear

Errors in browser console (if relevant):
apm-agent-python report error

TransportException("Unable to reach APM Server: HTTPConnectionPool(host='host', port=80): Read timed out. (read timeout=5) (url: http://host:80/intake/v2/events)")

@Ackerr welcome to the forum!

Steps to reproduce :

  1. restart apm-server (i think this because apm-server is not graceful shutdown)
  2. random appear

Can you please clarify: does this issue only occur after restarting apm-server? Is it just for a short time, or does it continue happening indefinitely?

And when you say restarting the server, which exact commands are you using? If you can provide a k8s manifest and list the kubectl commands you're running, that would be ideal.

Restart apm-server, every python-agent will report the error once.

command like this:

kubectl rollout restart deploy/apm -n elk
kubectl rollout status deploy/apm -n elk

k8s apm-server.yml:

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: apm
  namespace: elk
spec:
  selector:
    matchLabels:
      app: apm
  replicas: 4
  revisionHistoryLimit: 0
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  template:
    metadata:
      labels:
        app: apm
    spec:
      restartPolicy: Always
      containers:
        - name: apm
          image: docker.elastic.co/apm/apm-server:7.10.1
          ports:
            - containerPort: 8200
          resources:
            limits:
              cpu: 0.2
              memory: 800Mi
            requests:
              cpu: 0.1
              memory: 500Mi
          livenessProbe:
            tcpSocket:
              port: 8200
            initialDelaySeconds: 10
            periodSeconds: 10
          readinessProbe:
            httpGet:
              scheme: HTTP
              path: /
              port: 8200
            initialDelaySeconds: 10
            periodSeconds: 10
          startupProbe:
            httpGet:
              path: /
              port: 8200
              scheme: HTTP
            periodSeconds: 10
            failureThreshold: 3
          volumeMounts:
            - name: apm-data
              mountPath: /usr/share/apm-server/apm-server.yml
              subPath: apm-server.yml
      volumes:
        - name: apm-data
          configMap:
            name: apm-configmap

And agent will report error once in a while( a few hours) , but apm-server is normal and only have forbidden request log

ERROR	[request]	middleware/log_middleware.go:99	forbidden request	{"request_id": "660bcb35-ab3a-4189-ab12-2760358cbcae", "method": "POST", "URL": "/config/v1/agents", "content_length": 61, "remote_address": "10.0.27.12", "user-agent": "elasticapm-python/5.10.0", "event.duration": 119703, "response_code": 403, "error": "forbidden request: Agent remote configuration is disabled. Configure the `apm-server.kibana` section in apm-server.yml to enable it. If you are using a RUM agent, you also need to configure the `apm-server.rum` section. If you are not using remote configuration, you can safely ignore this error."}

I try add lifecycle in apm-server.yml. Now, Then rollout restart apm-server,no more errors report.

lifecycle:
  preStop:
    exec:
      command: [ "sh", "-c", "sleep 5 && kill -s HUP 1" ]

Random appear the error, not sure if it's because apm-server queue.mem.events set too large.

    queue.mem.events: 12288
    output.elasticsearch:
      worker: 3
      bulk_max_size: 4096

I'm not a Kubernetes expert (I'm hoping someone else can chime in on this topic), but if I understand the docs correctly then SIGTERM is sent to the process by default (as I would expect). It's very surprising to me if your change fixes things; APM Server should perform a graceful shutdown on either SIGTERM or SIGHUP. I guess the sleep is somehow helping.

If nobody else responds with an answer, I'll try to reproduce the issue soon.

Thanks for your reply !

Add some context

Before this error, Sentry often reported another error queue is full, Refer to the document

I add this config in apm-server config to fix it.

    queue.mem.events: 12288
    output.elasticsearch:
      worker: 3
      bulk_max_size: 4096

This config really fixed the queue is full error, but often report the Read time out error.

Now, I try to trun down the queue.mem.events,I don't know if it worked.

This topic was automatically closed 20 days after the last reply. New replies are no longer allowed.