Elastic Agents disappearing

I'm on 7.13.4 using fleet server and my agents have disappeared overnight. I've tried restarting the entire cluster (in elastic cloud) and the agents all still show as offline in fleet.

The cluster health is al good too. When I try and enroll a new agent I get :

Error: fail to enroll: fail to execute request to fleet-server: status code: 0, fleet-server returned an error: , messag
e: Unknown resource.

Any ideas on how I can get my agents back online please?

Is the Hosted Elastic Agent showing healthy in the UI? Did you change anything in the Elasticsearch or fleet-server URL in the Fleet UI?

The agents show up in Security > hosts but not in the Fleet > Agents screen.

No haven't changed anything in the fleet ui :slight_smile: I've observed this in two different (elastic cloud) clusters over the past few weeks.

Elastic Agent is our primary / only source of data so this is a big issue for us and preventing me from moving other sites across to Fleet

It will be really disappointing if fleet doesn't make it to GA as we've invested considerable time to it!

Sorry, I should have been more specific. In the Fleet UI there are at least 2 policies. One for the Hosted Elastic Agent and a Default one. On Hosted Elastic Agent run by Elastic Cloud is enrolled into the "Elastic Cloud agent policy" policy. If you go to the list of Agents in the Fleet UI, and you filter by this policy, do you see any Elastic Agent? If the answer is no, my next question is if you could share the content of this policy in YAML (remove the confidential part). There is an issue with migration we have seen recently which put in an empty array input: [] in this policy which breaks it. Hope this moves us a stop closer to the solution.

Thanks @ruflin - the agent is enrolled with that policy but it has been offline for 48h.

here is my policy :

id: policy-elastic-agent-on-cloud
revision: 2
    type: elasticsearch
      - >-
        - monitor
        - names:
            - logs-*
            - metrics-*
            - traces-*
            - .logs-endpoint.diagnostic.collection-*
            - synthetics-*
            - auto_configure
            - create_doc
    enabled: false
    logs: false
    metrics: false
  - id: 4feec4e9-2c0c-468e-8541-2966896dc125
    name: Fleet Server
    revision: 1
    type: fleet-server
    use_output: default
        name: fleet_server
        version: 0.9.1
      namespace: default
      port: 8220
      limits.max_connections: 200
      num_counters: 2000
      max_cost: 2097152
      policy_throttle: 200ms
        interval: 50ms
        burst: 25
        max: 100
        interval: 100ms
        burst: 10
        max: 10
        interval: 10ms
        burst: 20
        max: 20
        interval: 100ms
        burst: 5
        max: 10
      gc_percent: 20
    - >-

The policy looks fine. I was worried that the input part is missing but it looks fine. And chance you could send me a PM with the cluster id so I could have a look at the logs?

Thanks for the cluster id. Now finally had a look at it. Based on the logs I assume by now you upgraded to 7.14? I see some logs around the Elasticsearch cluster been overloaded but I guess that happened during the migration? It would also not explain why Elastic Agents were dropped. Did the 7.14 upgrade solve the issue?

How many Elastic Agents do you have enrolled? There was a memory leak in 7.13 that is fixed now but maybe that caused an issue if you had many Elastic Agents?

There is one more error I see in the logs that worries me a bit around invalid API keys. Did you by chance play around with the API Keys in Elasticsearch directly?

Thanks @ruflin - yeah I upgraded to 7.14 to see if that would fix the problem but alas it hasn't.

I've written a powershell script to install Elastic Agent and because I have been re-creating clusters so often I used a custom / predictable name and replaced the fleet endpoint url in the enroll url so that I can have a predictable endpoint URL (not sure it is that helpful in the end as enrollment tokens need to match)

I only had 30 something agents so there should be no capacity issues (IMHO) but the agents are really flaky at the moment across the board. Since 7.14 even the agents on linux servers have had issues / become unhealthy for no apparent reason...

We have found an issue on the Beats side (Fleet: policy aren't assigned to agents (flaky) · Issue #27299 · elastic/beats · GitHub) that affects some of the clusters but at the moment I think you are running into a different problem.

Can you tell me more about this replaced fleet endpoint? Is this like a proxy? Have you changed the fleet-server url in the Fleet UI?

BTW I see many "failed checkin" logs from the fleet-server which are likely explained by a wrong local url it checks in. Now I want to know even more about your fleet endpoint config.

This is the script GitHub - hilt86/installElasticAgent: Powershell script to deploy Elastic Agent

I've been setting a "Custom endpoint alias" in the Elastic Cloud portal and copying the url from the fleet section in :

Do you need any more info @ruflin ?

In the Fleet UI under Settings you have a line with fleet-server hosts. Did you make any modifications there?

no modifications & yep the correct fleet url is in there :slight_smile:

Sorry to pick on this again. Can you share the exact fleet-url that is there (with some ofuscation)? Did you use the alias as the fleet-server url or the one with the deployment id?

I've tried both with the same results - it is currently using : https://somethingsomething.fleet.eastus2.azure.elastic-cloud.com:443

With the linked powershell script I can increment the version number and it will re-install with the newer details.

Also is there supposed to be an 'Elastic Agent' service? On one of the endpoints this service is nowhere to be found, only the 'Elastic Endpoint' service exists..

I'm still trying to figure this one out. I keep seeing in the logs of fleet-server every few minutes fail checkin. Unfortunately more detailed logs are on the debug level. Lets try the following:

  • Go to the Agent list page
  • Select the hosted Elastic Agent
  • Got to Logs tab
  • Switch the "Agent logging level" to "debug".

I'm hoping this gives us some more details on why the checkin fails.

@blaker I could use your help on this one. Any further ideas?
@hilt86 For the fleet-server url, lets keep in the settings the one with the deployment id.

Yep keeping the fleet-server url as the one with the deployment id.

I don't actually see a hosted Elastic agent....that can't be good :

How do I get it back so I can enable logging?

This is definitively not good. Could you go do the Cloud console, remove the APM Fleet slider and add it again. This retriggers the setup. Did the Agent disappear after the 7.14 upgrade?