Dec 20th, 2023: [EN] Mastering Elastic Agent Scalability: Navigating the Holiday Surge with SQS, S3, and CloudWatch

Tags: #logs, #beats, filebeat, #scale, #cloudwatch, #s3, #sqs

Ever found yourself wondering how to scale your Agent or your standalone Beats implementation? Are you currently grappling with the challenge of ingesting data from numerous CloudWatch log groups, only to find your system overwhelmed by a legion of Grinch-like Beats processes gobbling up your machine's RAM or pushing your CPUs to their limits? You are thinking, how about I add one more or fifteen more Agents (say your Christmas campaign made your business go viral)? Before you do that, let's first have an overview of how Beats does its magic and then go through an example of how you can do that making sure you are making the most out of your setup.

Throughout this article, we will use an example cluster using Elastic Agent, but the essence of the content applies equally to standalone Beats. Elastic Agent simplifies the deployment and administration of the cluster when using multiple Beats, thereby making scaling simpler as well, ensuring centralized management of the agent and consistency across the policies and configurations. It also comes with the added benefit of lower overall CPU and memory consumption freeing up resources for your other applications.

We will dive into two types of Filebeat inputs i.e. S3-SQS and CloudWatch and in this context, we are going to discuss vertical and horizontal scaling also known as scale out (scale in for the reverse), and vertical scaling also known as scale up (scale down for the reverse).

Vertical scaling refers to increasing your resources (more RAM, CPU, etc.), and horizontal scaling refers to adding more agents/Beats i.e. instances. With that, we dive into the inputs.

S3-SQS input

The S3-SQS input (also referred to as S3 input) can operate in various modes among which:

  1. S3->Elastic

The S3-SQS input can be used to ingest directly from an Amazon S3 bucket. After the initial listing of the configured S3 bucket, Filebeat will poll the S3 API for changes to that bucket with a frequency of the bucket_list_interval config parameter (by default 300 seconds).

This mode of operation is stateful in Filebeat, which means that the state of each file is kept, including the last offset a harvester was reading from, and saved to disk regularly in what is called the Registry File. This file is crucial for preserving the reading position and ensuring all log lines are sent, particularly when the output destination (e.g., Elasticsearch or Logstash) is temporarily unreachable. In addition, the state information is held in memory while Filebeat is operational.

When Filebeat is restarted, it reconstructs the state using data from the registry file, allowing it to resume each harvester at the last known position. Filebeat assigns a unique identifier to the files so that a file is not identified by its name and path, making it impervious to rename and move operations. In this way, it detects whether a file has been processed before.

Because of the polling nature of the input, it is not possible to scale horizontally in this mode, only vertically. One config parameter you might find handy to try before moving to another server or changing the instance type in the cloud to a better provisioned one is the config parameter number_of_workers. Its default value for this input is 5. What this parameter does is create a specified number of lightweight threads (for those interested in the details, goroutines share a single address space and run concurrently). It can improve the performance significantly, so this might be something you want to try to tweak for your specific use case and configuration.

Tips on tweaking number_of_workers: You can start with the number of workers equal to the number of CPUs. Given that Filebeat performs a number of blocking (memory, ex. write to disk, network, ex. send data to Elasticsearch) operations, you can play with increasing the number of workers incrementally to find the optimal spot.

A valid configuration in this mode with multiple Agents/Beats is to have each one configured with a different bucket. There is no point in configuring with the same bucket for the above-mentioned reasons.

Note: For use cases involving a high volume of files for a bucket, the registry might become excessively large. For this, you can either look into configuration options in Filebeat to manage the size of the registry file or switch to operation mode #2, explained in continuation, to scale horizontally in a truly parallel manner.

  1. S3->SQS->Elastic / S3->SNS->SQS->Elastic

To make this configuration, you need to create an SQS queue and configure S3 to publish events to this queue by creating an event notification.

You can scale horizontally (scale out) by configuring multiple Filebeats to ingest from the same SQS queue. This is especially useful when you need to read a high volume of files. The mechanism behind ingesting from SQS is based on long polling. Instead of polling the server with a set frequency, the request by the client remains open on the server side with a relatively long timeout, and as soon as there are updates, the server sends the information to the client in a more real-time manner (i.e. lower latency).

SQS ensures at least one-time delivery, and one-time processing by keeping track of a visibility_timeout i.e. the receiver (i.e. Filebeat) has to operate within the visibility window to process the message and send back to the server a signal that it has not processed the message and the message can be safely deleted. If it does not do that within that window, the message will be sent again.

Scaling out in this way can potentially massively increase the throughput.

Note (S3 -> SNS -> SQS -> Elastic): You can also add the SNS service for message delivery in a push-based manner to the mix. First, you set up Amazon S3 Notifications to SNS. Then you create an SNS topic (i.e. a communication channel) and subscribe the SQS queues to this channel. When a message is sent to the topic, the SNS service will push the message to all subscribed SQS services. Keep in mind that SNS sends several copies of messages to several subscribers, and take into consideration if you need logs from one or more services.

CloudWatch input

The CloudWatch input can operate in various modes among which:

  1. CloudWatch -> Elastic

Similarly to the first mode of operation of the S3-SQS input, i.e. ingesting directly from an S3 bucket, this mode works on the same principle, polling the AWS API with a frequency of scan_frequency (by default 1 minute for CloudWatch).

This mode of operation is also stateful and works in an almost identical manner.

Similarly, because of the polling nature of the input, it is not possible to scale horizontally in this mode, only vertically. One config parameter you might find handy to try before moving to another server or changing the instance type in the cloud is the config parameter number_of_workers. Its default value for this input is 5. What this parameter does is create a specified number of lightweight threads (for those interested in the details, goroutines share a single address space and run concurrently). It can improve the performance significantly, so this might be something you want to try to tweak for your specific use case and configuration.

Tips on tweaking number_of_workers: You can start with the number of workers equal to the number of CPUs. Given that Filebeat performs a number of blocking (memory, ex. write to disk, network, ex. send data to Elasticsearch) operations, you can play with increasing the number of workers incrementally to find the optimal spot.

A valid configuration in this mode with multiple Agents/Beats is to have each one configured with a different log group. There is no point in configuring with the same log group for the reasons mentioned above.

As a rule of thumb:

  • If there are multiple log groups:
    Use one agent, and number_of_workers to scale within the agent
    Use an agent per log group/log group prefix
  • If there is only one log group and the user notices throttling:
    Specify a log stream/log stream prefix instead of a group per agent

Side note: The latency parameter takes into account the delay on the AWS side in publishing the events, which can result in missing documents.

Note: For use cases involving a high volume of logs, the registry might become excessively large. For this, you can either look into configuration options in Filebeat to manage the size of the registry file or switch to operation mode #2, explained in continuation, to scale horizontally in a truly parallel manner.

  1. Cloudwatch->S3->SQS->Elastic / CloudWatch -> Lambda->SNS->SQS->Elastic

Similarly to the second mode of operation of the S3-SQS input, this mode works on the same principle, so in this section, we will do a short recap and focus more on the options you have to set up the pipeline on the AWS side.

CloudWatch has an option to export logs to S3, however it has to be done manually each time. If this fits your use case, you can export the Logs to S3 and follow the same instructions for S3->SQS->Elasticsearch, mode of operation #2 for the S3-SQS input.

To automate this process, you can use a Subscription Filter with AWS Lambda instead. The Lambda function can then send the logs to SNS. You can then configure one or more queues to subscribe to the SNS topics. For more information on this pattern and an example of how to implement it, you can have a look at this blog.

Finally, you can scale horizontally (scale out) by configuring multiple Filebeats to ingest from the same SQS queue. This is especially useful when you need to read a high volume of files. The mechanism behind ingesting from SQS is based on long polling, explained in more detail above for the second mode of operation of the S3-SQS input.

SQS ensures at least one-time delivery, and one-time processing by keeping track of a visibility_timeout i.e. the receiver (i.e. Filebeat) has to operate within the visibility window to process the message and send back to the server a signal that it has not processed the message and the message can be safely deleted. If it does not do that within that window, the message will be sent again.

Conclusion - note on the final performance

Scaling out to multiple agents will not necessarily increase the throughput by the same factor. This is because the end-to-end solution consists of several components as shown in the diagram below and we have only been looking at the optimization we can do on the shipper (Agent/Beats) to speed up the processing time.

Keep in mind that there are delays in making the data available on the AWS as well, between services themselves. In this article, we also did not look at Elasticsearch. The end-to-end results will depend on all the components in the chain. Now that we have looked at Beats, to get you started on how to configure ES for the most optimal performance for your use case you can check this blog post.

Finally, if you are targeting live processing of logs then other solutions such as ESF and Kinesis Firehose are likely a better fit.

This post is also available in espagnol.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.