[Filebeat] Bug with multiple aws-cloudwatch logs using log_group_name_prefix

I have a lot of running AWS Lambdas and I'm trying to fetch their logs using Filebeat's aws-cloudwatch input type. I'm using the start_position: beginning in order to backfill the logs when the filebeat starts.

When I specify a single log everything works as expected, but when I use log_group_name_prefix: /aws/lambda, which will match many logs, instead of trying to fetch all logs from the beginning, Filebeat will only try to fetch the first log using startTime = 1970-01-01 00:00:00 +0000 UTC, endTime = 2022-11-23 16:28:17 +0000 UTC" and then try to fetch only the scan_frequency interval for the rest of the logs.
What I expect is that the start_position configuration to be respected for each individual/unique log, but instead it's only applied for the first API call.

My configuration:

    filebeat.inputs:
       - type: aws-cloudwatch
         log_group_name_prefix: /aws/lambda
         region_name: eu-west-1
         scan_frequency: 15m
         start_position: beginning

Debug logs from the running Filebeat:

The first one for the first Lambda /aws/lambda/abn-verification-staging-abn-verification you can see the startTime is 1970-01-01 00:00:00

{"log.level":"debug","@timestamp":"2022-11-23T16:28:16.918Z","log.logger":"registrar","log.origin":{"file.name":"registrar/registrar.go","file.line":140},"message":"Starting Registrar","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-11-23T16:28:16.918Z","log.logger":"input.aws-cloudwatch","log.origin":{"file.name":"compat/compat.go","file.line":113},"message":"Input 'aws-cloudwatch' starting","service.name":"filebeat","id":"648125F3E9E24916","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2022-11-23T16:28:17.102Z","log.logger":"input.aws-cloudwatch.cloudwatch_poller","log.origin":{"file.name":"awscloudwatch/input.go","file.line":167},"message":"start_position = beginning, startTime = 1970-01-01 00:00:00 +0000 UTC, endTime = 2022-11-23 16:28:17 +0000 UTC","service.name":"filebeat","id":"648125F3E9E24916","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2022-11-23T16:28:17.102Z","log.logger":"input.aws-cloudwatch.cloudwatch_poller","log.origin":{"file.name":"awscloudwatch/input.go","file.line":159},"message":"sleeping for 1m0s before checking new logs","service.name":"filebeat","id":"648125F3E9E24916","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-11-23T16:28:17.102Z","log.logger":"input.aws-cloudwatch.cloudwatch_poller","log.origin":{"file.name":"awscloudwatch/input.go","file.line":204},"message":"aws-cloudwatch input worker for log group: '/aws/lambda/abn-verification-staging-abn-verification' has started","service.name":"filebeat","id":"648125F3E9E24916","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2022-11-23T16:28:17.319Z","log.logger":"input.aws-cloudwatch.cloudwatch_poller","log.origin":{"file.name":"awscloudwatch/cloudwatch.go","file.line":87},"message":"sleeping for 200ms before making FilterLogEvents API call again","service.name":"filebeat","id":"648125F3E9E24916","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2022-11-23T16:28:17.520Z","log.logger":"input.aws-cloudwatch.cloudwatch_poller","log.origin":{"file.name":"awscloudwatch/cloudwatch.go","file.line":89},"message":"done sleeping","service.name":"filebeat","id":"648125F3E9E24916","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2022-11-23T16:28:17.520Z","log.logger":"input.aws-cloudwatch.cloudwatch_poller","log.origin":{"file.name":"awscloudwatch/cloudwatch.go","file.line":91},"message":"Processing #0 events","service.name":"filebeat","id":"648125F3E9E24916","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-11-23T16:28:17.520Z","log.logger":"input.aws-cloudwatch.cloudwatch_poller","log.origin":{"file.name":"awscloudwatch/input.go","file.line":200},"message":"aws-cloudwatch input worker for log group '/aws/lambda/abn-verification-staging-abn-verification' has stopped.","service.name":"filebeat","id":"648125F3E9E24916","ecs.version":"1.6.0"}

The second log for the next lambda /aws/lambda/aws-slack-notifications with startDate of 2022-11-23 16:28:17 while I expect to see 1970-01-01 00:00:00 :

{"log.level":"debug","@timestamp":"2022-11-23T16:29:17.102Z","log.logger":"input.aws-cloudwatch.cloudwatch_poller","log.origin":{"file.name":"awscloudwatch/input.go","file.line":161},"message":"done sleeping","service.name":"filebeat","id":"648125F3E9E24916","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2022-11-23T16:29:17.103Z","log.logger":"input.aws-cloudwatch.cloudwatch_poller","log.origin":{"file.name":"awscloudwatch/input.go","file.line":167},"message":"start_position = beginning, startTime = 2022-11-23 16:28:17 +0000 UTC, endTime = 2022-11-23 16:29:17 +0000 UTC","service.name":"filebeat","id":"648125F3E9E24916","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2022-11-23T16:29:17.103Z","log.logger":"input.aws-cloudwatch.cloudwatch_poller","log.origin":{"file.name":"awscloudwatch/input.go","file.line":159},"message":"sleeping for 1m0s before checking new logs","service.name":"filebeat","id":"648125F3E9E24916","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-11-23T16:29:17.103Z","log.logger":"input.aws-cloudwatch.cloudwatch_poller","log.origin":{"file.name":"awscloudwatch/input.go","file.line":204},"message":"aws-cloudwatch input worker for log group: '/aws/lambda/aws-slack-notifications' has started","service.name":"filebeat","id":"648125F3E9E24916","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2022-11-23T16:29:17.468Z","log.logger":"input.aws-cloudwatch.cloudwatch_poller","log.origin":{"file.name":"awscloudwatch/cloudwatch.go","file.line":87},"message":"sleeping for 200ms before making FilterLogEvents API call again","service.name":"filebeat","id":"648125F3E9E24916","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2022-11-23T16:29:17.668Z","log.logger":"input.aws-cloudwatch.cloudwatch_poller","log.origin":{"file.name":"awscloudwatch/cloudwatch.go","file.line":89},"message":"done sleeping","service.name":"filebeat","id":"648125F3E9E24916","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2022-11-23T16:29:17.668Z","log.logger":"input.aws-cloudwatch.cloudwatch_poller","log.origin":{"file.name":"awscloudwatch/cloudwatch.go","file.line":91},"message":"Processing #0 events","service.name":"filebeat","id":"648125F3E9E24916","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-11-23T16:29:17.668Z","log.logger":"input.aws-cloudwatch.cloudwatch_poller","log.origin":{"file.name":"awscloudwatch/input.go","file.line":200},"message":"aws-cloudwatch input worker for log group '/aws/lambda/aws-slack-notifications' has stopped.","service.name":"filebeat","id":"648125F3E9E24916","ecs.version":"1.6.0"}

Hi @Kaiyan_Sheng, could you please have a look here?

Hello @EDzhelyov , thanks for reaching out! According to AWS API documentation, startTime can only go from Jan 1, 1970 00:00:00 UTC FilterLogEvents - Amazon CloudWatch Logs.

Hi Kaiyan,

I'm unsure how the documentation you linked is related to my issue.

What I would expect to happen when you have a start_position: beginning and multiple group logs matched by the log_group_name_prefix is to fetch all records from the beginning for each log group matches, instead, (at least in 8.4) it will only happen for the first group log.

There is another issue we just found. The scan_frequency option is applied for each individual log group. What I would expect to happen when we have a log_group_name_prefix and a scan_frequency of 15min, is to query each log group at 15 minutes intervals. Instead what we see in the logs is that Filebeat does only one query every 15 minutes.

You can see that by looking at two consecutive logs from the Filebeat process:

{"log.level":"info","@timestamp":"2022-11-29T09:34:28.191Z","log.logger":"input.aws-cloudwatch.cloudwatch_poller","log.origin":{"file.name":"awscloudwatch/input.go","file.line":200},"message":"aws-cloudwatch input worker for log group '/aws/lambda/rds-slack-notifications' has stopped.","service.name":"filebeat","id":"652FC3D5076F3391","ecs.version":"1.6.0"}

{"log.level":"info","@timestamp":"2022-11-29T09:49:27.308Z","log.logger":"input.aws-cloudwatch.cloudwatch_poller","log.origin":{"file.name":"awscloudwatch/input.go","file.line":204},"message":"aws-cloudwatch input worker for log group: '/aws/lambda/record-unsubscribed-clients-production-record-unsubscribed' has started","service.name":"filebeat","id":"652FC3D5076F3391","ecs.version":"1.6.0"}

The first query for the /aws/lambda/rds-slack-notifications starts at 09:34 and then the next query in the log is for the next log group at 09:49.

Ping @Kaiyan_Sheng. Can you check my comments from the above? I think there is a bug with the aws-cloudwatch input when having a log_group_name_prefix that basically makes the input unusable for more than 1 log group.

Hi @EDzhelyov sorry for the late response. I think what you are looking for is number_of_workers. If you want more than one log group getting processed at the same time, this parameter needs to be set. The default value is 1.

@Kaiyan_Sheng thanks for the answer, I will test it.

Because I want to support a dynamic number of Cloudwatch log groups and I don't have a mechanism to update the number_of_workers configuration every time we add a new log group, can I just set it to a big enough number, like 1,000 without impacting the processing power of the Filebeat?

@EDzhelyov So we added number_of_workers to help parallelize reading from multiple log groups. If number_of_workers is set to 10, then 10 workers will run at the same time to read from 10 different log groups. If you have 100 log groups matching the prefix, then these 10 workers will each run 10 times ish. The problem with specifying a big number like 1000 is that will make Filebeat occupy a lot of resources.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.