Elastic Agents disappearing

hilt86 · August 2, 2021, 2:54am

I'm on 7.13.4 using fleet server and my agents have disappeared overnight. I've tried restarting the entire cluster (in elastic cloud) and the agents all still show as offline in fleet.

The cluster health is al good too. When I try and enroll a new agent I get :

Error: fail to enroll: fail to execute request to fleet-server: status code: 0, fleet-server returned an error: , messag
e: Unknown resource.

Any ideas on how I can get my agents back online please?

ruflin · August 2, 2021, 1:01pm

Is the Hosted Elastic Agent showing healthy in the UI? Did you change anything in the Elasticsearch or fleet-server URL in the Fleet UI?

hilt86 · August 3, 2021, 12:21am

The agents show up in Security > hosts but not in the Fleet > Agents screen.

No haven't changed anything in the fleet ui I've observed this in two different (elastic cloud) clusters over the past few weeks.

Elastic Agent is our primary / only source of data so this is a big issue for us and preventing me from moving other sites across to Fleet

It will be really disappointing if fleet doesn't make it to GA as we've invested considerable time to it!

ruflin · August 3, 2021, 9:25am

Sorry, I should have been more specific. In the Fleet UI there are at least 2 policies. One for the Hosted Elastic Agent and a Default one. On Hosted Elastic Agent run by Elastic Cloud is enrolled into the "Elastic Cloud agent policy" policy. If you go to the list of Agents in the Fleet UI, and you filter by this policy, do you see any Elastic Agent? If the answer is no, my next question is if you could share the content of this policy in YAML (remove the confidential part). There is an issue with migration we have seen recently which put in an empty array input: [] in this policy which breaks it. Hope this moves us a stop closer to the solution.

hilt86 · August 4, 2021, 12:51am

Thanks @ruflin - the agent is enrolled with that policy but it has been offline for 48h.

here is my policy :

id: policy-elastic-agent-on-cloud
revision: 2
outputs:
  default:
    type: elasticsearch
    hosts:
      - >-
        https://some.eastus2.azure.elastic-cloud.com:443
output_permissions:
  default:
    _fallback:
      cluster:
        - monitor
      indices:
        - names:
            - logs-*
            - metrics-*
            - traces-*
            - .logs-endpoint.diagnostic.collection-*
            - synthetics-*
          privileges:
            - auto_configure
            - create_doc
agent:
  monitoring:
    enabled: false
    logs: false
    metrics: false
inputs:
  - id: 4feec4e9-2c0c-468e-8541-2966896dc125
    name: Fleet Server
    revision: 1
    type: fleet-server
    use_output: default
    meta:
      package:
        name: fleet_server
        version: 0.9.1
    data_stream:
      namespace: default
    server:
      port: 8220
      host: 0.0.0.0
      limits.max_connections: 200
    cache:
      num_counters: 2000
      max_cost: 2097152
    server.limits:
      policy_throttle: 200ms
      checkin_limit:
        interval: 50ms
        burst: 25
        max: 100
      artifact_limit:
        interval: 100ms
        burst: 10
        max: 10
      ack_limit:
        interval: 10ms
        burst: 20
        max: 20
      enroll_limit:
        interval: 100ms
        burst: 5
        max: 10
    server.runtime:
      gc_percent: 20
fleet:
  hosts:
    - >-
      https://else.fleet.eastus2.azure.elastic-cloud.com:443

ruflin · August 4, 2021, 11:24am

The policy looks fine. I was worried that the input part is missing but it looks fine. And chance you could send me a PM with the cluster id so I could have a look at the logs?

ruflin · August 9, 2021, 7:41am

Thanks for the cluster id. Now finally had a look at it. Based on the logs I assume by now you upgraded to 7.14? I see some logs around the Elasticsearch cluster been overloaded but I guess that happened during the migration? It would also not explain why Elastic Agents were dropped. Did the 7.14 upgrade solve the issue?

How many Elastic Agents do you have enrolled? There was a memory leak in 7.13 that is fixed now but maybe that caused an issue if you had many Elastic Agents?

There is one more error I see in the logs that worries me a bit around invalid API keys. Did you by chance play around with the API Keys in Elasticsearch directly?

hilt86 · August 10, 2021, 12:37am

Thanks @ruflin - yeah I upgraded to 7.14 to see if that would fix the problem but alas it hasn't.

I've written a powershell script to install Elastic Agent and because I have been re-creating clusters so often I used a custom / predictable name and replaced the fleet endpoint url in the enroll url so that I can have a predictable endpoint URL (not sure it is that helpful in the end as enrollment tokens need to match)

I only had 30 something agents so there should be no capacity issues (IMHO) but the agents are really flaky at the moment across the board. Since 7.14 even the agents on linux servers have had issues / become unhealthy for no apparent reason...

ruflin · August 11, 2021, 11:40am

We have found an issue on the Beats side (Fleet: policy aren't assigned to agents (flaky) · Issue #27299 · elastic/beats · GitHub) that affects some of the clusters but at the moment I think you are running into a different problem.

Can you tell me more about this replaced fleet endpoint? Is this like a proxy? Have you changed the fleet-server url in the Fleet UI?

ruflin · August 11, 2021, 11:48am

BTW I see many "failed checkin" logs from the fleet-server which are likely explained by a wrong local url it checks in. Now I want to know even more about your fleet endpoint config.

hilt86 · August 11, 2021, 9:43pm

This is the script GitHub - hilt86/installElasticAgent: Powershell script to deploy Elastic Agent

I've been setting a "Custom endpoint alias" in the Elastic Cloud portal and copying the url from the fleet section in :

hilt86 · August 12, 2021, 9:19pm

Do you need any more info @ruflin ?

ruflin · August 13, 2021, 7:07am

In the Fleet UI under Settings you have a line with fleet-server hosts. Did you make any modifications there?

hilt86 · August 16, 2021, 4:00am

no modifications & yep the correct fleet url is in there

ruflin · August 16, 2021, 1:31pm

Sorry to pick on this again. Can you share the exact fleet-url that is there (with some ofuscation)? Did you use the alias as the fleet-server url or the one with the deployment id?

hilt86 · August 17, 2021, 3:45am

I've tried both with the same results - it is currently using : https://somethingsomething.fleet.eastus2.azure.elastic-cloud.com:443

With the linked powershell script I can increment the version number and it will re-install with the newer details.

hilt86 · August 17, 2021, 3:47am

Also is there supposed to be an 'Elastic Agent' service? On one of the endpoints this service is nowhere to be found, only the 'Elastic Endpoint' service exists..

ruflin · August 17, 2021, 11:52am

I'm still trying to figure this one out. I keep seeing in the logs of fleet-server every few minutes fail checkin. Unfortunately more detailed logs are on the debug level. Lets try the following:

Go to the Agent list page
Select the hosted Elastic Agent
Got to Logs tab
Switch the "Agent logging level" to "debug".

I'm hoping this gives us some more details on why the checkin fails.

@blaker I could use your help on this one. Any further ideas?
@hilt86 For the fleet-server url, lets keep in the settings the one with the deployment id.

hilt86 · August 17, 2021, 12:44pm

Yep keeping the fleet-server url as the one with the deployment id.

I don't actually see a hosted Elastic agent....that can't be good :

How do I get it back so I can enable logging?

ruflin · August 19, 2021, 6:39am

This is definitively not good. Could you go do the Cloud console, remove the APM Fleet slider and add it again. This retriggers the setup. Did the Agent disappear after the 7.14 upgrade?

Topic		Replies	Views
Waiting for a Fleet Server to connect… error Endpoint Security fleet	5	2511	August 25, 2021
Agents periodically disconnecting from Fleets Elastic Agent fleet	2	215	March 19, 2024
Enrolling Elastic Agent shows up in Fleet Agents but goes from "Updating" to "Offline" Elasticsearch docker , fleet	8	1136	January 5, 2024
Elastic Agent / Endpoint no longer can connect to fleet server Elastic Agent	2	936	November 14, 2022
Fleet server problem Kibana fleet	4	754	December 30, 2021

Elastic Agents disappearing

Related topics