Elastic agent unhealthy

We are encountering errors in our current deployment involving Fleet Server and Fleet Agent components. The specific errors we are facing are as follows:

Fleet Server Error: Error Message: "Non-zero metrics in the last 30s"

Fleet Agent Error: Error Message: "Cannot check in with fleet-server, retrying"

elastic-agent status
┌─ fleet
│ └─ status: (FAILED) status code: 0, fleet-server returned an error: , message: The upstream server is timing out
└─ elastic-agent
└─ status: (HEALTHY) Running
Environment:

Fleet Server is deployed within our “infrastructure” cluster. This cluster includes Elasticsearch and Kibana components, which are functioning correctly.

Fleet Agent is deployed in one of our Kubernetes “playground” clusters. The purpose of this agent is to collect Kubernetes logs and other observability-related data.

In Kibana the agent is unhealthy/offline (status is flapping from healthy to offline and sometimes back) while the fleet is healthy and online all the time. Interestingly enough, even though the Fleet Agents are periodically marked as offline, when we have a look at the agent metrics, these seem to be still collecting.

Additional Information: We need assistance in identifying and resolving these errors to ensure the proper functioning of our deployment. Any guidance or support in addressing these issues would be greatly appreciated. Thank you for your assistance.

fleet:
  access_api_key: Y1pHOHgtdw==
  agent:
    id: ac177c50-da37-490b-9ed8-a755be756174
  enabled: true
  host: localhost:5601
  hosts:
  - https://fleet-server.xyz.com:443
  protocol: http
  ssl:
    renegotiation: never
    verification_mode: full
  timeout: 10m0s

@abdul90082 We were facing a similar issue, but were getting 504 errors in our agent logs that were produced by an Nginx ingress. We found that there is a Fleet server setting checkin_long_poll that defaults to 5m, where the Nginx timeouts were set to a default of 60s, so if the checkin exceeded 60s, we would see the agents flapping and the 504 errors in the logs. Setting checkin_long_poll to 60s to match the ingress seems to have resolved the issue for us.

It doesn't appear we were having the same issue you are, but hopefully this helps.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.