ECK Fleet Server Behind Ingress - elastic agents becoming unhealthy

Hi All,

Brief Fact:

We have deployed ECK in Azure AKS. The whole thing is behind an Ingress as seen in the below Diagram. The requirement is to connect elastic agents which are residing outside of the ECK cluster to fleet servers ( which are residing inside the cluster ). Agents can be from internal corporate network or can connect through Internet. Therefore an Ingress has been setup to load balance between fleet servers. In the Ingress we have configured three backend service :

We have no problem in connecting Kibana and Elasticsearch through Ingress .

Issue Currently Being Faced:

The issue we are facing is when any elastic -agent which is outside the cluster is trying to connect to fleet-server through Ingress , The agent is getting successfully enrolled but it is turning unhealthy.

What we found out in local agent's log is, after the agent is enrolled in the fleet server ( Ingress URL - https://xxxx.mydomain.com:443/fleetserver-eck is used during enrollment ) the fleet server is actually returning back it's internal URL - [https://fleet-server-eck-agent-http.namespace.svc:8220/api/status? ] in response to the elastic agent . It is the fleet server's Kubernetes service URL which the external elastic agent has no means to resolve to.

The exact error is :

{"log.level":"error","@timestamp":"2022-08-26T09:30:13.406Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":211},"message":"failed to dispatch actions, error: fail to communicate with updated API client hosts: Get "[https://fleet-server-eck-agent-http.namespace.svc:8220/api/status?\](https://fleet-server-eck-agent-http.namespace.svc:8220/api/status?\)": lookup fleet-server-eck-agent-http.namespace.svc on 10.96.0.10:53: no such host","ecs.version":"1.6.0"}.

Different Options Tried

  • Added the Ingress URL in the Kibana config xpack.fleet.agents.fleet_server.hosts: along with the fleet server's service URL . i.e. :

         - https://xxxx.mydomain.com:443/fleetserver-eck
         - https://fleet-server-eck-agent-http.namespace.svc:8220
    
  • Used --proxy-url and provided the ingress url https://xxxx.mydomain.com:443/fleetserver-eck during starting the elastic agent

None of the above options helped .

Note: When tried curl in https://xxxx.mydomain.com:443/fleetserver-eck/api/status , It is showing healthy status.

Elastic Agent Configuration

apiVersion: agent.k8s.elastic.co/v1alpha1
kind: Agent
metadata:
  name: elastic-agent-ums
  namespace: observability
spec:
  version: 8.4.0
  kibanaRef:
    name: kibana-eck
  fleetServerRef:
    name: fleet-server-eck
  mode: fleet
  daemonSet:
    podTemplate:
      spec:
        serviceAccountName: elastic-agent-serviceaccount
        hostNetwork: true
        dnsPolicy: ClusterFirstWithHostNet
        automountServiceAccountToken: true
        securityContext:
          runAsUser: 0
  1. Fleet Server Configuration
apiVersion: agent.k8s.elastic.co/v1alpha1
kind: Agent
metadata:
  name: fleet-server-eck
  namespace: observability
spec:
  version: 8.4.0
  kibanaRef:
    name: kibana-eck
  elasticsearchRefs:
    - name: elasticsearch-eck
  mode: fleet
  fleetServerEnabled: true
  deployment:
    replicas: 2
    podTemplate:
      spec:
        serviceAccountName: fleet-server-serviceaccount
        automountServiceAccountToken: true
        securityContext:
          runAsUser: 0
  1. Kibana Config
apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
  name: kibana-eck
  namespace: observability
spec:
  version: 8.4.0
  count: 2
  elasticsearchRef:
    name: elasticsearch-eck
  config:
    xpack.fleet.agents.elasticsearch.hosts:
      ["https://elasticsearch-eck-es-http.observability.svc:9200"]
    xpack.fleet.agents.fleet_server.hosts:
      ["https://fleet-server-eck-agent-http.observability.svc:8220"]
    xpack.fleet.packages:
      - name: system
        version: latest
      - name: elastic_agent
        version: latest
      - name: fleet_server
        version: latest
      - name: kubernetes
        version: 0.14.0
      - name: apm
        version: latest
      # pinning this version as the next one introduced a kube-proxy host setting default that breaks this recipe,
      # see https://github.com/elastic/integrations/pull/1565 for more details

    xpack.fleet.agentPolicies:
      - name: Fleet Server on ECK policy
        id: eck-fleet-server
        namespace: observability
        monitoring_enabled:
          - logs
          - metrics
        is_default_fleet_server: true
        package_policies:
          - name: fleet_server-1
            id: fleet_server-1
            package:
              name: fleet_server
      - name: Elastic Agent on ECK policy
        id: eck-agent
        namespace: observability
        monitoring_enabled:
          - logs
          - metrics
        unenroll_timeout: 900
        is_default: true
        package_policies:
          - name: system-1
            id: system-1
            package:
              name: system
          - name: kubernetes-1
            id: kubernetes-1
            package:
              name: kubernetes
          - name: apm-1
            id: apm-1
            package:
              name: apm
            inputs:
              - type: apm
                enabled: true
                vars:
                  - name: host
                    value: 0.0.0.0:8200

We are stuck with this issue now for many days. Any help is much appreciated. We really need help on this. Please let us know if any additional configuration we need to do which is currently missing. Also whether what we are trying to achieve even it is supported now or not. Thanks

I believe your problem here is in your configuration of your URLs.

In your Kibana configuration you're setting:

config:
    xpack.fleet.agents.elasticsearch.hosts:
      ["https://elasticsearch-eck-es-http.observability.svc:9200"]
    xpack.fleet.agents.fleet_server.hosts:
      ["https://fleet-server-eck-agent-http.observability.svc:8220"]

These 2 settings are what Fleet will tell the Elastic Agents to point to. Since the URLs you're setting are local to the Kubernetes cluster, they will really only work in the cluster.

I believe you will need to change these URLs to be the public facing (ingress) URLs. Once you change them, Fleet will tell all Elastic Agents to connect to the public URLs.

That should allow Elastic Agents outside your cluster to work. It will also allow Elastic Agents inside your cluster to work (provided that they have the same DNS servers/entries available to know what the Ingress URLs resolve to.

@BenB196 Thanks for your suggestion. I have already tried the options like the below:

config:
    xpack.fleet.agents.elasticsearch.hosts:
      ["https://elasticsearch-eck-es-http.observability.svc:9200","https://xxxx.mydomain.com:443/elasticsearch-eck" ]
    xpack.fleet.agents.fleet_server.hosts:
      ["https://fleet-server-eck-agent-http.observability.svc:8220","https://xxxx.mydomain.com:443/fleetserver-eck]

also the option:

config:
    xpack.fleet.agents.elasticsearch.hosts:
      ["https://xxxx.mydomain.com:443/elasticsearch-eck"]
    xpack.fleet.agents.fleet_server.hosts:
      ["https://xxxx.mydomain.com:443/fleetserver-eck"]

In the first option , the elastic agent receiving the following response :

"message":"failed to dispatch actions, error: fail to communicate with updated API client hosts: Get "[https://fleet-server-eck-agent-http.namespace.svc:8220/api/status?\](https://fleet-server-eck-agent-http.namespace.svc:8220/api/status?\)": lookup fleet-server-eck-agent-http.namespace.svc on 10.96.0.10:53: no such host","ecs.version":"1.6.0"}.

In the second option , the agent is getting enrolled and becoming healthy but getting the following error, post which there is no activity and no logs :

{"log.level":"error","@timestamp":"2022-08-31T04:59:15.943Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":293},"message":"Could not communicate with fleet-server Checking API will retry, error: could not decode the response, raw response: <html>\r\n<head><title>504 Gateway Time-out</title></head>\r\n<body>\r\n<center><h1>504 Gateway Time-out</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n","ecs.version":"1.6.0"}

I don't understand why they don't have a very clear documentation of this use case where fleet is used behind a load balancer or ingress . This link - Fleet Server deployment models | Fleet and Elastic Agent Guide [8.4] | Elastic talks about different deployment model where one is fleet server behind Load balancer But I don't find any detailed documentation on this.

I also used --proxy-url option and provided the ingress url that also giving bad request.

I don't understand why it has to be such difficult to set up. Whereas Kibana is easily accessible behind Ingress then why is this not ? There are few people posted this issue where they faced similar situation , I didn't find any suitable reply for this yet.

Dear Elastic Developers/members , Could you please help on this and provide a clear guidance on this issue ? Any help is much appreciated.

Hi @sanju_techie the second option you tried should've worked, interesting. The error message you got is from Nginx itself;

<html>\r\n<head><title>504 Gateway Time-out</title></head>\r\n<body>\r\n<center><h1>504 Gateway Time-out</h1></center>\r\n<hr><center>nginx</center>

Would you be able to provide the Ingress configuration of Nginx, the Kubernetes service configuration, and the redacted command you're using to install/enroll your agents?

Could you also look at the Nginx logs to see what request is resulting in the 504 error? This would probably help with debugging.

@BenB196

Thanks for the suggestion, Have made it work now connecting both external and internal agent to the cluster . This was not straight forward , lot of work around had to do . The issue is when we provide the ingress URLs in the host that's become the output for the agent to connect to. Now the internal agents has no means to connect to the Ingress URL without specifying the URLs of Fleet and Elastic search specifically in environment variable of the container of both the fleet enabled agent and only agent.

Also a xpack.fleet.output section has to be added in Kibana yaml and need to provide the CA certificate of the elasticsearch host. The same CA has to be provided as an absolute path in the internal elastic agent and fleet agent.

Elasticsearch restricts having separate output adding to separate policies in basic license that actually enables keeping two output one for internal and one for external agents. But anyways had to make it work as we are basic license users.

Thanks for the hints and suggestion. Really appreciated.