Very poor elastic-agent performance with SQS/S3 for Cloudtrail logs

I've read the July blog post claiming 8 x t3.micro instances can process > 40,000 events/second using SQS/S3. That comes out to > 5,000 events/second/host.

I currently have 1 x m6a.large and 1 x t3a.small running 8.15.1, combined I can't get them over 33.23 events/second. They are definitely getting Cloudtrail records but it is painfully slow. Here is the 30 second metrics for the larger of the 2 instances:

{
	"log.level": "info",
	"@timestamp": "2024-09-24T14:41:34.120Z",
	"message": "Non-zero metrics in the last 30s",
	"component": {
		"binary": "filebeat",
		"dataset": "elastic_agent.filebeat",
		"id": "aws-s3-default",
		"type": "aws-s3"
	},
	"log": {
		"source": "aws-s3-default"
	},
	"log.logger": "monitoring",
	"log.origin": {
		"file.line": 192,
		"file.name": "log/log.go",
		"function": "github.com/elastic/beats/v7/libbeat/monitoring/report/log.(*reporter).logSnapshot"
	},
	"service.name": "filebeat",
	"monitoring": {
		"ecs.version": "1.6.0",
		"metrics": {
			"beat": {
				"cgroup": {
					"memory": {
						"mem": {
							"usage": {
								"bytes": 515555328
							}
						}
					}
				},
				"cpu": {
					"system": {
						"ticks": 350,
						"time": {
							"ms": 20
						}
					},
					"total": {
						"ticks": 1460,
						"time": {
							"ms": 110
						},
						"value": 1460
					},
					"user": {
						"ticks": 1110,
						"time": {
							"ms": 90
						}
					}
				},
				"handles": {
					"limit": {
						"hard": 65535,
						"soft": 65535
					},
					"open": 18
				},
				"info": {
					"ephemeral_id": "6f2ee31c-60ff-4bda-bbc7-40af4db66766",
					"uptime": {
						"ms": 270113
					},
					"version": "8.15.1"
				},
				"memstats": {
					"gc_next": 98311080,
					"memory_alloc": 79375672,
					"memory_sys": 262144,
					"memory_total": 224758128,
					"rss": 222179328
				},
				"runtime": {
					"goroutines": 73
				}
			},
			"filebeat": {
				"events": {
					"active": 57,
					"added": 275,
					"done": 280
				},
				"harvester": {
					"open_files": 0,
					"running": 0
				}
			},
			"libbeat": {
				"config": {
					"module": {
						"running": 1
					}
				},
				"output": {
					"events": {
						"acked": 280,
						"active": 0,
						"batches": 3,
						"total": 280
					},
					"read": {
						"bytes": 7734,
						"errors": 3
					},
					"write": {
						"bytes": 100902,
						"latency": {
							"histogram": {
								"count": 26,
								"max": 196,
								"mean": 60.23076923076923,
								"median": 46,
								"min": 12,
								"p75": 84,
								"p95": 190.04999999999999,
								"p99": 196,
								"p999": 196,
								"stddev": 46.27036573836574
							}
						}
					}
				},
				"pipeline": {
					"clients": 5,
					"events": {
						"active": 57,
						"published": 275,
						"total": 275
					},
					"queue": {
						"acked": 280,
						"added": {
							"bytes": 800390,
							"events": 275
						},
						"consumed": {
							"bytes": 794717,
							"events": 280
						},
						"filled": {
							"bytes": 183267,
							"events": 57,
							"pct": 0.0178125
						},
						"max_bytes": 0,
						"max_events": 3200,
						"removed": {
							"bytes": 794717,
							"events": 280
						}
					}
				}
			},
			"registrar": {
				"states": {
					"current": 0
				}
			},
			"system": {
				"load": {
					"1": 0.1,
					"15": 0.09,
					"5": 0.12,
					"norm": {
						"1": 0.05,
						"15": 0.045,
						"5": 0.06
					}
				}
			}
		}
	},
	"ecs.version": "1.6.0"
}

I have tried the balanced and throughput settings, there was no discernible change. Is there some magic work faster switch I haven't found documented?

Thanks,
Scott

I did attempt to downgrade to 8.14.1 as mentioned in the July 2024 blog but there was no change.

I was able to get up to a total of ~ 100events/second combined accross both instances after cleaning out my elastic-agent.yml to the bare minimum and switching back to the throughput setting. Still a far cry from 5,000/s as claimed by the ES blog.

What are your configurations for the SQS input?

Like, what is the number of maximum concurrent sqs messages that you are using?

I have a fleet manager with a Cloudtrail input where I'm using 50 maximum concurrent SQS messages and I can get something around 2000 e/s on a 8vCPU/16 GB VM.

I'm not using any preset, but my Elastic Agent output is configured to use these settings:

worker: 4
bulk_max_size: 1000

Also, is this the blog post you mention?

If so, there is no mention about it being tested with Cloudtrail data, they mention this:

For the data in this article, we utilized S3 objects of mixed sizes, with JSON logs, with objects containing between 1 and 100K events.

I would expect a difference in processing s3 objects where each line is a single json log and cloudtrail events where the agent first needs to unnest the json logs from the records object.

Thanks @leandrojmp, good to hear a higher throughput is possible and yes, that is the blog post I was mentioning.

I hadn't seen the max_number_of_messages, I think I'll play with that setting a little bit. It does seem that one might be my issue because my workers are pretty much sitting idle. Thanks.

The balanced preset which is supposedly the default is supposed to set bulk_max_size to 1600 and workers to 1, throughput is supposedly 1600 and 4. I'll mess with these if I don't have any luck with the max_number_of_messages setting.

In my opinion the default value of 5 is pretty low for Cloudtrail logs, specially if it is an organization trail, I was having some lag in the ingestion and started changing this until I got to real time data without impacting the Agent.

One issue with the cloudtrail logs is that the agent needs to parse each message to split the records field in multiple events and this will impact in the performance as well.

My suggestion is to change the max_number_of_messages to see if you can increase the event rate, and if you reach any plateau after that, scale horizontally with another agent.

Yep, it is an organizational trail with over 40 accounts, lots of messages to sort through. Taking max_number_of_messages to 20 did the trick for me, I am now getting much better throughput of ~350 events/second which is sufficient for my needs. I was actually able to scale down my deployment to a single instance.

I appreciate your assistance sir.