Adding high availability options to libbeat

Hi folks,

About a year and a half ago we added a number of options to a local fork of logstash-forwarder, for high availability.
The two main things we needed at the time were:

  1. Having a concept of "destinations". Each input would be tagged with a specific destination to send to. Multiple destinations (ie, multiple ELK clusters) could be configured as outputs, and each would take the log lines given to it and forward them to the right place.
  2. Connecting to all listed output servers simultaneously, and sending lines to all of them at the same time. This way, if one server goes down, we don't block the entire pipeline trying to send logs to it. It also enables us to get much higher throughput. In this case, all logstash servers use roughly equal amounts of CPU. If one server gets too slow, it ends up getting less traffic until it recovers.

We're going to migrate to Filebeat soon, and would very much like to contribute these changes back upstream.
If anyone has thoughts on how these changes should look, specific things we'd need to address in order to get these accepted, etc, we'd love to hear them.

Thanks!

  1. How was tagging done? Different tag per prospector? You have had multiple logstash endpoints per tag used for failover/replication?

  2. You're basically asking for replication or a mix of loadbalancing + replication? e.g. forward to '3' logstash instance out of 12? Replication by random or based on grouping multiple logstash endpoints and forwarding message to one LS instance per group? Did you wait for all logstash instances to ACK or only a subset? When indexing into ES from multiple LS instances, how did you guarantee deduplicate (create document _id)?

Filebeat already supports loadbalancing (connect to all hosts) and failover mode (connect to single host). In loadbalancing mode batches are forward to one logstash instance at a time (potentially in parallel, there is still room for improvement) and if logstash instance fails the batch will be forwarded to another logstash instance. In loadbalancing, if one instance is slower, it will get less work. There is definitely (loads of) room for improvements in beats, but it's worth discussing how these improvements should look like and about configurability. More complicated 'patterns' falling into the realm of event routing for example shall be handled by logstash (which tries to be more generic), not beats (tries to be much more narrow in scope).

Hi Steffen! Sorry for the slow reply :slight_smile:

  1. The config listed multiple "destinations". Each of these is just a block of Logstash servers. When Lumberjack started, a channel was created for each destination. When each prospector started it would this destination tag to point new harvesters at the right output channels. We have different ELK clusters for different purposes (compliance, dev/debug data, production data, security, etc) and it was a convenient way to run a single instance of Lumberjack which forwarded to each of them based on the filename.

  2. type type type... delete delete delete.. :slight_smile: I looked up the documentation on mode.balance and LoadBalancerMode - this is exactly what we're doing right now, I think. Connecting to all listed Logstash servers, and using channels to balance sending to all of them concurrently. So that's awesome!

If there are improvements, I would be more than happy to discuss and work on them :slight_smile:

  1. this sounds like minimal event routing mostly similar to topic-based pubsub. It's kind of out of scope at the moment and the team will have to discuss if and how event routing shall be supported by libbeat.
    Personally I've been thinking about adding output groups and make use of generic filtering to select events based on configured predicates. Granted it's more CPU intensive, as we don't hook up channels but rely on per event filtering, but also more flexible. For example use different (or additional) output for log messages having a CRITICAL log level. Reserving an extra output for CRITICAL + alerting ensures queues are not shared (queues for CRITICAL are hopefully most likely be empty), giving CRITICAL logs a chance to be processed faster for alerting in logstash.

  2. Great.

This is definitely an idea I can see being helpful to many people.
It's a great way to let people be flexible with the data they have!

Thank you!