Elastic agent unhealthy

abdul90082 · November 1, 2023, 10:31am

We are encountering errors in our current deployment involving Fleet Server and Fleet Agent components. The specific errors we are facing are as follows:

Fleet Server Error: Error Message: "Non-zero metrics in the last 30s"

Fleet Agent Error: Error Message: "Cannot check in with fleet-server, retrying"

elastic-agent status
┌─ fleet
│ └─ status: (FAILED) status code: 0, fleet-server returned an error: , message: The upstream server is timing out
└─ elastic-agent
└─ status: (HEALTHY) Running
Environment:

Fleet Server is deployed within our “infrastructure” cluster. This cluster includes Elasticsearch and Kibana components, which are functioning correctly.

Fleet Agent is deployed in one of our Kubernetes “playground” clusters. The purpose of this agent is to collect Kubernetes logs and other observability-related data.

In Kibana the agent is unhealthy/offline (status is flapping from healthy to offline and sometimes back) while the fleet is healthy and online all the time. Interestingly enough, even though the Fleet Agents are periodically marked as offline, when we have a look at the agent metrics, these seem to be still collecting.

Additional Information: We need assistance in identifying and resolving these errors to ensure the proper functioning of our deployment. Any guidance or support in addressing these issues would be greatly appreciated. Thank you for your assistance.

fleet:
  access_api_key: Y1pHOHgtdw==
  agent:
    id: ac177c50-da37-490b-9ed8-a755be756174
  enabled: true
  host: localhost:5601
  hosts:
  - https://fleet-server.xyz.com:443
  protocol: http
  ssl:
    renegotiation: never
    verification_mode: full
  timeout: 10m0s

jbmke · November 10, 2023, 3:49pm

@abdul90082 We were facing a similar issue, but were getting 504 errors in our agent logs that were produced by an Nginx ingress. We found that there is a Fleet server setting checkin_long_poll that defaults to 5m, where the Nginx timeouts were set to a default of 60s, so if the checkin exceeded 60s, we would see the agents flapping and the 504 errors in the logs. Setting checkin_long_poll to 60s to match the ingress seems to have resolved the issue for us.

It doesn't appear we were having the same issue you are, but hopefully this helps.

system · December 8, 2023, 3:50pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elastic agent goes offline & healthy every 5 minutes Elastic Agent fleet	4	68	October 16, 2024
Fail to checkin to fleet-server Elastic Agent fleet	17	6171	July 10, 2023
Elastic Agents Unhealthy Status Elastic Agent elastic-stack-monitoring	1	384	December 5, 2022
Fail to checkin to fleet server Elastic Agent	8	1933	December 19, 2022
Fleet on GKE (ECK) behind a Google LB 502 errors Elastic Agent	1	139	March 13, 2024

Elastic agent unhealthy

Related topics