Opentelemetry agent unable to connect APM Server over secure connection

Kibana version:
7.13.4
Elasticsearch version:
7.13.4
APM Server version:
7.13.4
APM Agent language and version:
NA
Browser version:
NA
Original install method (e.g. download page, yum, deb, from source, etc.) and version:
Helm
Fresh install or upgraded from other version?
Fresh
Is there anything special in your setup? For example, are you using the Logstash or Kafka outputs? Are you using a load balancer in front of the APM Servers? Have you changed index pattern, generated custom templates, changed agent configuration etc.
NA
Description of the problem including expected versus actual behavior. Please include screenshots (if relevant):
We set api_key.enabled: true in API server then creates API key in APM server for the Open Telemetry agent but they are unable to connect to APM Server over secure TLS connection. APM server is running fine and is connected to Elasticsearch over TLS connection.

Steps to reproduce:

  1. Install Otel agent

Errors in browser console (if relevant):
NA
Provide logs and/or server output (if relevant):
Below is the yaml configuration of Otel Agent Daemonset

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-agent-conf
  namespace: es
  labels:
    app: opentelemetry
    component: otel-agent-conf
data:
  otel-agent-config: |
    receivers:
      hostmetrics:
        collection_interval: 10s
        scrapers:
          cpu:
          load:
          memory:
      otlp:
        protocols:
          grpc:
          http:
      jaeger:
        protocols:
          grpc:
          thrift_compact:
          thrift_http:
      zipkin:
    exporters:
      otlp/elastic:
        endpoint: "https://apm-server.es.svc.cluster.local:8200"
        headers:
          # Elastic APM Server API key
          Authorization: "ApiKey ${ELASTIC_APM_SERVER_APIKEY}"
      logging:
        loglevel: WARN
    processors:
      batch:
      memory_limiter:
        # Same as --mem-ballast-size-mib CLI argument
        ballast_size_mib: 165
        # 80% of maximum memory up to 2G
        limit_mib: 400
        # 25% of limit up to 2G
        spike_limit_mib: 100
        check_interval: 5s      
    service:
      pipelines:
        metrics:
          receivers: [otlp, hostmetrics]
          processors: [batch]
          exporters: [otlp/elastic, logging]
        traces:
          receivers: [otlp, jaeger, zipkin]
          processors: [memory_limiter, batch]
          exporters: [otlp/elastic, logging]
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: otel-agent
  namespace: es
  labels:
    app: opentelemetry
    component: otel-agent
spec:
  selector:
    matchLabels:
      app: opentelemetry
      component: otel-agent
  template:
    metadata:
      labels:
        app: opentelemetry
        component: otel-agent
    spec:
      containers:
      - command:
          - "/otelcol"
          - "--config=/conf/otel-agent-config.yaml"
          # Memory Ballast size should be max 1/3 to 1/2 of memory.
          # - "--mem-ballast-size-mib=165"          
        image: 470776511283.dkr.ecr.ap-south-1.amazonaws.com/dev-reco-otel:latest
        name: otel-agent
        resources:
          limits:
            cpu: 500m
            memory: 500Mi
          requests:
            cpu: 100m
            memory: 100Mi
        ports:
        - containerPort: 6831 # Jaeger Thrift Compact
          protocol: UDP
        - containerPort: 8888 # Prometheus Metrics
        - containerPort: 9411 # Default endpoint for Zipkin receiver.
        - containerPort: 14250 # Default endpoint for Jaeger gRPC receiver.
        - containerPort: 14268 # Default endpoint for Jaeger HTTP receiver.
        - containerPort: 4317 # Default OpenTelemetry gRPC receiver port.
        - containerPort: 55681 # Default OpenTelemetry HTTP receiver port.
        env:
           # Get pod ip so that k8s_tagger can tag resources
          - name: POD_IP
            valueFrom:
              fieldRef:
                fieldPath: status.podIP
            # This is picked up by the resource detector
          - name: OTEL_RESOURCE_ATTRIBUTES
            value: "k8s.pod.ip=$(POD_IP)"
          - name: ELASTIC_APM_SERVER_APIKEY
            valueFrom:
              secretKeyRef:
                name: elastic-apm-server-key
                key: encryptionkey
        volumeMounts:
        - name: otel-agent-config-vol
          mountPath: /conf
        livenessProbe:
          httpGet:
            path: /
            port: 13133 # Health Check extension default port.
        readinessProbe:
          httpGet:
            path: /
            port: 13133 # Health Check extension default port.
      volumes:
        - configMap:
            name: otel-agent-conf
            items:
              - key: otel-agent-config
                path: otel-agent-config.yaml
          name: otel-agent-config-vol

Below is the error in one of the otel agent pod

2021-08-01T12:08:26.594Z        info    service/collector.go:211        Everything is ready. Begin running and processing data.
2021-08-01T12:08:36.610Z        info    exporterhelper/queued_retry.go:325      Exporting failed. Will retry the request after interval.        {"kind": "exporter", "name": "otlp/elastic", "error": "failed to push metrics data via OTLP exporter: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: tls: first record does not look like a TLS handshake\"", "interval": "5.52330144s"}
2021-08-01T12:08:46.632Z        info    exporterhelper/queued_retry.go:325      Exporting failed. Will retry the request after interval.        {"kind": "exporter", "name": "otlp/elastic", "error": "failed to push metrics data via OTLP exporter: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: tls: first record does not look like a TLS handshake\"", "interval": "5.822800266s"}
2021-08-01T12:08:48.416Z        info    service/collector.go:225        Received signal from OS {"signal": "terminated"}
2021-08-01T12:08:48.416Z        info    service/collector.go:331        Starting shutdown...

Please help us to achieve the connection.

@axw Could you please have a look on this request? I believe you can help me out like previous issue

Thanks
Nitin G

Can you please share your apm-server config? The log message suggests that APM Server is not configured to communicate with agents over TLS.

1 Like

Thanks for replying @axw

Below is my APM config

apmConfig:
  apm-server.yml: |
    apm-server:
      host: "0.0.0.0:8200"
      api_key.enabled: true 
      api_key.limit: 50
    queue: {}
    output.elasticsearch:
      hosts: ["https://es-client.es.svc.cluster.local:9200"]
      username: "${ELASTICSEARCH_USERNAME}"
      password: "${ELASTICSEARCH_PASSWORD}"
      protocol: https
      ssl.enabled: true
      ssl.key: /usr/share/apm-server/config/certs/tls.key
      ssl.certificate: /usr/share/apm-server/config/certs/tls.crt
      ssl.certificate_authorities: /usr/share/apm-server/config/certs/ca.crt

Thanks. So what you have here is apm-server communicating with Elasticsearch over TLS, but not with agents. For the latter, you need to also configure apm-server.ssl. Please refer to SSL/TLS communication | APM Server Reference [7.13] | Elastic

If you don't need TLS between the agents and server, then you can set insecure: true in your opentelemetry-collector exporter config.

2 Likes

Hey @axw

First I generated certificate for APM like below

---
# Certificate issuer
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
  name: ca-issuer
  namespace: es
spec:
  selfSigned: {}
--- 
# APM Server certificate and secret with tls.cert and tls.key
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: apm-cert
  namespace: es
spec:
  secretName: apm-cert
  subject:
    organizations:
    - vinculum
  isCA: true
  privateKey:
    algorithm: RSA
    encoding: PKCS1
    size: 2048
  usages:
    - server auth
    - client auth
  dnsNames:
  - localhost
  - 127.0.0.1
  - apm-server.es.svc
  - apm-server.es.svc.cluster.local
  issuerRef:
    name: ca-issuer
    kind: Issuer
EOF
---

Then I modified the apm_config as below

apmConfig:
  apm-server.yml: |
    apm-server:
      host: "0.0.0.0:8200"
      api_key.enabled: true 
      api_key.limit: 50
      ssl.enabled: true
      ssl.key: /usr/share/apm-server/config/certs/apm/tls.key
      ssl.certificate: /usr/share/apm-server/config/certs/apm/tls.crt
    queue: {}
    output.elasticsearch:
      hosts: ["https://es-client.es.svc.cluster.local:9200"]
      username: "${ELASTICSEARCH_USERNAME}"
      password: "${ELASTICSEARCH_PASSWORD}"
      protocol: https
      ssl.enabled: true
      ssl.key: /usr/share/apm-server/config/certs/es/tls.key
      ssl.certificate: /usr/share/apm-server/config/certs/es/tls.crt
      ssl.certificate_authorities: /usr/share/apm-server/config/certs/es/ca.crt

Then I modified liveliness/Readiness probe to evaluate healthcheck over HTTPS.
Then I installed APM server with Helm and it worked fine.
Then I regenerated Authentication API key and created a new secret out of it to use in OTEL agent config. But the log error changed to below

2021-08-03T05:43:41.061Z        info    service/collector.go:211        Everything is ready. Begin running and processing data.
2021-08-03T05:43:51.084Z        info    exporterhelper/queued_retry.go:325      Exporting failed. Will retry the request after interval.        {"kind": "exporter", "name": "otlp/elastic", "error": "failed to push metrics data via OTLP exporter: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: x509: certificate signed by unknown authority\"", "interval": "5.52330144s"}
2021-08-03T05:44:01.102Z        info    exporterhelper/queued_retry.go:325      Exporting failed. Will retry the request after interval.        {"kind": "exporter", "name": "otlp/elastic", "error": "failed to push metrics data via OTLP exporter: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: x509: certificate signed by unknown authority\"", "interval": "5.822800266s"}

I generated the Kibana and ES certificate in similar way but they APM-server or Kibana doesn't throw an error like this.

APM Server has been configured with the custom certificate authorities:

apmConfig:
  apm-server.yml: |
    ...
    output.elasticsearch:
      ...
      ssl.certificate_authorities: /usr/share/apm-server/config/certs/es/ca.crt

This is why APM Server does not have any such errors. You will need to configure the opentelemetry-collector exporter similarly. According to opentelemetry-collector/README.md at main · open-telemetry/opentelemetry-collector · GitHub, you should set ca_file in the otlp exporter config.

1 Like

@axw hat's really cool advice. I managed to update tls settings in OTEL agents config as below and it worked like a charm.

    exporters:
      otlp/elastic:
        endpoint: "https://apm-server.es.svc.cluster.local:8200"
        headers:
          # Elastic APM Server API key
          Authorization: "ApiKey ${ELASTIC_APM_SERVER_APIKEY}"
        ca_file: /usr/share/apm-server/config/certs/ca.crt
        cert_file: /usr/share/apm-server/config/certs/tls.crt
        key_file: /usr/share/apm-server/config/certs/tls.key

But few minutes later Daemonset failed with below status

  Warning  Unhealthy  3s (x5 over 43s)   kubelet            Liveness probe failed: Get "https://192.168.178.233:13133/": dial tcp 192.168.178.233:13133: connect: connection refused
  Warning  Unhealthy  2s (x5 over 42s)   kubelet            Readiness probe failed: Get "https://192.168.178.233:13133/": dial tcp 192.168.178.233:13133: connect: connection refused

I am trying to look for the solution as my healthcheck config is like this. Where I tried adding scheme: HTTPS but no fate

        livenessProbe:
          httpGet:
            path: /
            port: 13133 # Health Check extension default port.
            scheme: HTTPS
        readinessProbe:
          httpGet:
            path: /
            port: 13133 # Health Check extension default port.
            scheme: HTTPS

Also, one thing still confuses me a lot and I see that you have answered the similar query somewhere on the GitHub issues. It goes like below:-

What changes we should make to have sensible name under the APM section. Also, there is no metric visible inside this Unknown module. Please suggest

Hey @axw I was able to fix this by adding extensions in the otel agent config

    extensions:
      health_check: {}
    service:
      extensions:
        - health_check

Also edited probes like below:-

        livenessProbe:
          httpGet:
            path: /
            # host: 0.0.0.0
            port: 13133 # Health Check extension default port.
            # scheme: HTTP
        readinessProbe:
          httpGet:
            path: /
            # host: 0.0.0.0
            port: 13133 # Health Check extension default port.

I will create a separate thead for the APM stuff :slightly_smiling_face:

1 Like

This topic was automatically closed 20 days after the last reply. New replies are no longer allowed.