Best practices for spot/preemptible instances

I'm a noob with this stuff so looking for some guidance.

Say I have a GCP cluster that uses preemptible instances, or an AWS cluster that uses spot instances. If I have an Agent policy with the "Google Cloud Platform" or "AWS" integrations pointed at those clusters and gathering metrics/logs, then I end up with an ever-growing list of agents as the "spot" instances come and go.

How do people manage this? I've only been at this for a few days and already 85% of my agents are listed as "offline". Is there some way to purge old agents or is that not desirable? Do I just have to get used to filtering by "healthy" in the UI?

thanks,
Marco

P.S. What's the difference between "offline" and "inactive" in the pull-down?

1 Like

If I understand your question correctly, you want to bulk purge offline agents. Have you tried to get a list of agents and then filter by offline?

curl -X GET "http://localhost:8220/api/fleet/agents" -H "Authorization: Bearer <your_token>"

and then you could possibly do a bulk unenroll

curl -X POST "http://localhost:8220/api/fleet/agents/bulk_unenroll" -H "Authorization: Bearer <your_token>" -H "Content-Type: application/json" --data-raw '{"agents": [<offline_agent_ids>]}' 

Thanks Sunile. I hadn't considered that. I don't even know if I really want/should do this anyway.

That said, i stumbled upon the "Unenrollment timeout" setting the other day. This is available under Settings for each Agent policy. Looks like one can automatically unenroll any agent that has been gone for more than a configurable number of seconds. I think this will do what I want. (Whether what I want is a good idea is another question... :slight_smile:

thanks,
Marco

yes that makes sense. Docs for those who may run into similar issue: Set unenrollment timeout for ephemeral hosts | Fleet and Elastic Agent Guide [7.17] | Elastic

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.