Umbrella S3 logs weird format when using CSV filter

Hello !

I'm currently working on pulling Cisco Umbrella logs from S3 buckets with Logstash and s3 input and I'm dealing with a weird behavior.

When using only the s3 input and sending the logs to Elastic, it works like a charm. But as soon as I want to use a CSV filter to parse the logs, it looks like the charset is wrong and failed to parse the logs.

Using an other pipeline to do the parsing works with pipeline to pipeline communication. Although for whatever reason, I need to build the field @timestamp myself because the month is wrong.

I have a workaround so it isn't a rush but it isn't ideal as I would like to have the parsing in the same pipeline.

Here is my S3 input:

input {
    s3 {
        access_key_id => "${s3_access_key_id}"
        id => "cisco_umbrella_aws_s3_bucket"
        bucket => "${s3_bucket}"
        region => "${s3_region}"
        secret_access_key => "${s3_secret_access_key}"
        prefix => "${s3_prefix}/dnslogs"
        add_field => {
            "[log_category]" => "network"
            "[log_subcategory]" => "dns"
            "[log_vendor]" => "cisco"
            "[log_product]" => "umbrella"
        }
    }
}

Here is filter part of my pipeline for DNS logs:

filter {
  csv {
    columns => ["[cisco][umbrella][_tmp][time]",
    "[cisco][umbrella][identity]",
    "[cisco][umbrella][identities]",
    "[source][address]",
    "[source][nat][ip]",
    "[cisco][umbrella][action]",
    "[dns][question][type]",
    "[dns][response_code]",
    "[dns][question][name]",
    "[cisco][umbrella][categories]",
    "[cisco][umbrella][policy_identity_type]",
    "[cisco][umbrella][identity_types]",
    "[cisco][umbrella][blocked_categories]"]
    id => "cisco_umbrella_dns_parsing_csv"
  }
  mutate {
    rename => {
      "[cisco][umbrella][_tmp][time]" => "event_creation_time"
    }
    split => {
      "[cisco][umbrella][identities]" => ","
      "[cisco][umbrella][identity_types]" => ","
      "[cisco][umbrella][categories]" => ","
      "[cisco][umbrella][blocked_categories]" => ","
    }
    remove_field => ["[cisco][umbrella][_tmp]"]
    id => "cisco_umbrella_dns_mutate"
  }
  date {
    match => ["event_creation_time", "yyyy-MM-dd HH:mm:ss"]
    id => "cisco_umbrella_dns_date"
  }
  mutate {
    add_field => {
      "[event][action]" => "dns-request-%{[cisco][umbrella][action]}"
      "[observer][type]" => "dns"
    }
    id => "cisco_umbrella_ecs_compliance_mutate"
  }
}

To show you, here is a log with correct format when using only the input:

"2024-10-09 07:19:57","John Doe (JohnDoe@mycompany.com)","John Doe (JohnDoe@mycompany.com),Default Site,DEVICENAME,SITE","10.10.10.10","1.1.1.1","Allowed","1 (A)","NXDOMAIN","mydomain.dns.domain.","","Group","Goup,Sites,and other,stuff",""

Here is a part of a log a soon as I put the csv filter:

1tList2d (A)",0:00","JohnIllow Doe"Group,erIllow Lis"Group,erIl3es,and other stuff,stuff",""

It seems like the charset was changed and the input can't anymore delimit the logs correctly. Has anyone already seen this behavior ?

Last piece of information, I'm running Logstash on version 8.13.4.

Thanks in advance for your help !

Daniel