Add *built-in* rate limiting/throttling

There are multiple threads on this, but they're locked now. I think the existing solutions are all inadequate for a server environment with shared hardware/network infrastructures, where "playing nice" is important.

Describe the enhancement:

Add rate limiting to the configuration options available to the various Beats.

The current docs suggest to use traffic control (tc) to add rate limiting. This is really suboptimal, for a few reasons:

  1. tc is incredibly difficult to use. I learned it enough to write the commands needed to do the throttling I wanted. Looking back, only a few months later, I have absolutely no clue what they mean. It's an inevitable maintenance problem.

  2. It mutates global state and breaks encapsulation of this application. What if another application also runs a script to configure tc for its own traffic shaping? It's a collision waiting to happen.

  3. The configuration for the rate limiting is encoded into whatever script sets up tc. In effect, this is now a "second" config file, meaning that you no longer have a centralized place for all your Filebeat configuration needs.

  4. It's not cross-platform.

  5. Go offers a bunch of great rate limiting options. There's a better way.

Describe a specific use case for the enhancement or feature:

  • A common issue is that Filebeat "builds a backlog" (by remembering "where it left off") when Elasticsearch becomes temporarily unavailable. When it comes back up, Filebeat hammers it. If you have lots of hosts, they're all competing, greedily trying to run as fast as possible, causing all kinds of load and brown out issues.
    • More specifically, my personal issue is that my Filebeat agents and Elasticsearch clusters have negotiated a speed that works for them, but is too fast for our network infrastructure, which I don't have the power to change.
  • Filebeat/Elasticsearch also has this strange tendency to sometimes spike traffic higher than normal during times when the cluster is overloaded. Seems counter-intuitive, but I can't find the steps to replicate.

It still on top of our mind to give users more knobs to turn down and I personally interested in this scenario, So I am trying to gather information.

A common issue is that Filebeat "builds a backlog" (by remembering "where it left off") when Elasticsearch becomes temporarily unavailable. When it comes back up, Filebeat hammers it. If you have lots of hosts, they're all competing, greedily trying to run as fast as possible, causing all kinds of load and brown out issues.

This is a problem with all network related clients when an error occurs each Filebeat client will try to reconnect when you run a large swarm of beats. The reconnection rate of the retry can really cause traffic spikes, if you look at the connection graph you will see distinct bumps at a regular interval.

In 6.6.1 we have added some jitter in our backoff mechanism used by the outputs, this will really make a difference and should help distribute the reconnect over time.

1 Like

Hey, thanks for reaching back!

That's great to hear, but the connection spike wasn't the only issue I was experiencing. My primary issue was that the speed Filebeat and Elasticsearch negotiated between each other was too fast for the (multi-occupant) load balancer that was in between them. We hogged all the traffic and browned out our neighbors.

How a multi-tenant load balancer got shipped without throttling functionality... is beyond me. But still, I'm looking for a dependable, simple, and portable application-layer fix to the problem

@amomchilov Ok I understand, you want a way to cap the bandwith (or EPS) for either the whole stream of data or a subset of the stream. We do not have the knobs at the moment.

The closest issue we have is this one: https://github.com/elastic/beats/issues/662 and https://github.com/elastic/beats/pull/6035 but I will try to create a new one with a proposal.

@pierhugues Indeed! An EPS cap would probably be easier to implement, but really, a bandwidth cap would be best.

I suspect that anyone who would want rate limiting, would be thinking about it in terms of data rate. EPS can be an approximation of bandwidth, but capping bandwidth directly would be even better.

Closest issue is https://github.com/elastic/beats/issues/3847 rate limiting and sampling of data (depending on the source beat)

Great, thanks!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.