Elastic Agent/Fleet Limit how many agents get a policy update at once

Hi All,

I was wondering if there is a way to limit the number of agents that can pull an updated policy at once?

The reason for this, is I have noticed that in large deployments where I perform a policy update (i.e.: adding a new integration, or changing an integration setting), it can cause an in rush of events that hamper the cluster.

An example is deploying the systems integration to a policy with ~95 (Linux) agents in it. The system integration settings are fairly default, collecting the default log settings and metric settings.

From these graphs:

You can see that when this change is deployed, the cluster goes from indexing ~20k eps to ~140k eps (around this clusters max indexing rate). I'd much rather have fleet limit the number of agents that can pull updates within a given timeframe than have the cluster hit and stay at max index rate for a long period of time. The main concern here is that I don't want search times to be adversely affected by this large uptick in ingest.

Note: One workaround I thought about was dividing agents into smaller policies, but I think with large deployments, this would quickly become hard to manage.

Hi @BenB196

I think what you try to achieve could be done by configuring server.limits.policy_throttle in your fleet server integration policy (see the doc here Fleet Server scalability | Fleet and Elastic Agent Guide [master] | Elastic)

Hi @nchaulet, thanks for the information. So, I looked into this, and I'm not sure if I'm misinterpreting it, it seems like how often a new update is rolled out to "all" agents.

How often a new policy is rolled out to the agents.

In this case, I read the setting as, if I make a policy change, it will wait 200ms to see if I make another policy change, then once the 200ms is up, it will roll out the change.

What I'm more looking for, is if I make a policy change, only 20 agents can be in the process of applying that change at a time, all other agents are throttled from applying the change until <20 agents are applying the change.

There is no current mechanism to do what you have described. There is a simple throttle in the Fleet Server to limit the rate at which all policies can rollout. However, it is not granular enough to control rollout rates per policy, nor does it sample the response rate as an input to the throttle. In addition, the throttle is per Fleet Server instance, and does not take into account the rollouts occurring in other running Fleet Servers, if any are executing.

Fine grained controls on a specific policy rollout would be a new feature in Fleet that does not yet exist.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.