I'm on 7.13.4 using fleet server and my agents have disappeared overnight. I've tried restarting the entire cluster (in elastic cloud) and the agents all still show as offline in fleet.
The cluster health is al good too. When I try and enroll a new agent I get :
Error: fail to enroll: fail to execute request to fleet-server: status code: 0, fleet-server returned an error: , messag
e: Unknown resource.
Any ideas on how I can get my agents back online please?
Sorry, I should have been more specific. In the Fleet UI there are at least 2 policies. One for the Hosted Elastic Agent and a Default one. On Hosted Elastic Agent run by Elastic Cloud is enrolled into the "Elastic Cloud agent policy" policy. If you go to the list of Agents in the Fleet UI, and you filter by this policy, do you see any Elastic Agent? If the answer is no, my next question is if you could share the content of this policy in YAML (remove the confidential part). There is an issue with migration we have seen recently which put in an empty array input: [] in this policy which breaks it. Hope this moves us a stop closer to the solution.
The policy looks fine. I was worried that the input part is missing but it looks fine. And chance you could send me a PM with the cluster id so I could have a look at the logs?
Thanks for the cluster id. Now finally had a look at it. Based on the logs I assume by now you upgraded to 7.14? I see some logs around the Elasticsearch cluster been overloaded but I guess that happened during the migration? It would also not explain why Elastic Agents were dropped. Did the 7.14 upgrade solve the issue?
How many Elastic Agents do you have enrolled? There was a memory leak in 7.13 that is fixed now but maybe that caused an issue if you had many Elastic Agents?
There is one more error I see in the logs that worries me a bit around invalid API keys. Did you by chance play around with the API Keys in Elasticsearch directly?
Thanks @ruflin - yeah I upgraded to 7.14 to see if that would fix the problem but alas it hasn't.
I've written a powershell script to install Elastic Agent and because I have been re-creating clusters so often I used a custom / predictable name and replaced the fleet endpoint url in the enroll url so that I can have a predictable endpoint URL (not sure it is that helpful in the end as enrollment tokens need to match)
I only had 30 something agents so there should be no capacity issues (IMHO) but the agents are really flaky at the moment across the board. Since 7.14 even the agents on linux servers have had issues / become unhealthy for no apparent reason...
BTW I see many "failed checkin" logs from the fleet-server which are likely explained by a wrong local url it checks in. Now I want to know even more about your fleet endpoint config.
Sorry to pick on this again. Can you share the exact fleet-url that is there (with some ofuscation)? Did you use the alias as the fleet-server url or the one with the deployment id?
Also is there supposed to be an 'Elastic Agent' service? On one of the endpoints this service is nowhere to be found, only the 'Elastic Endpoint' service exists..
I'm still trying to figure this one out. I keep seeing in the logs of fleet-server every few minutes fail checkin. Unfortunately more detailed logs are on the debug level. Lets try the following:
Go to the Agent list page
Select the hosted Elastic Agent
Got to Logs tab
Switch the "Agent logging level" to "debug".
I'm hoping this gives us some more details on why the checkin fails.
@blaker I could use your help on this one. Any further ideas? @hilt86 For the fleet-server url, lets keep in the settings the one with the deployment id.
This is definitively not good. Could you go do the Cloud console, remove the APM Fleet slider and add it again. This retriggers the setup. Did the Agent disappear after the 7.14 upgrade?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.