I upgraded elastic version to 8.12.0 recently. Today, when I tried to upgrade my agents, they are stuck in "unhealthy" state with a few errors stating Download failure and "No action taken" (screenshot attached).
What could be cause of this issue? Is there a way to force update the agents again so that they are able to download required files?
Most likely this is a network issue that already existed but the upgrade to 8.12.0 has highlighted it. Endpoint needs to redownload user artifacts after an upgrade, which means if those downloads no longer work you'd see this error.
Do you see anything in Endpoint's logs that indicate what is happening? You might need to enable Debug logging (go to Fleet -> Agents -> select the host -> Logs -> change the log level at the bottom of the page -> click apply changes). If you do change the level to Debug I recommend setting it back to Info after investigating, since Debug logs are much more verbose and if you're storing them in Elasticsearch they'll use more storage.
By default Endpoint's logs are ingested into Elasticsearch in the index logs-elastic_agent.endpoint_security-default and you can explore them in the Agent details page (Fleet -> Agents -> select the host -> Logs) or in Observability -> Logs. You can also search Endpoint's logs on the aftected host. They're stored in /opt/Elastic/Endpoint/state/log (Linux), /Library/Elastic/Endpoint/state/log (macOS), c:\Program Files\Elastic\Endpoint\state\log (Windows).
After setting the log level to Debug (if you can), reapply Endpoint's policy (go to the Endpoint policy and hit save without making any changes). After the policy applies look in Endpoint's new logs. You should see a log with the content Downloading artifact ... and then a failure. Look just before and after that and you should see some logging that indicates why the download fails. Endpoint does back off trying to download artifacts when there are network failures to limit unnecessary network traffic, so if you see a log about it failing because of a backoff look further up in the logs for the original network failure.
If after looking at the logs you still aren't sure what's going on, you can DM the endpoint log to me and I can take a look for you. If you need to do that I will share a secure upload link with you.
Thanks for the clear explanation. This is really helpful.
I finally solved it by restarting the fleet server -> assigning agent to a different policy -> reassign agent to correct policy -> edit policy and save it. I managed to get it to work at the end. I'll follow your steps next time something like this happens.
On a different note, is it a good idea to run upgrade for machines that are offline?
I don't have any real guidance to give for whether or not upgrading offline machines is a good idea. If they're offline then the upgrade will happen when they come online, meaning you might not be watching your stack if there are any problems like the one you encountered. On the other hand, it isn't always feasible to expect all Agent's to be online at once (e.g. laptops that are opened and closed).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.