Aws vpcflow parser not loading/executing

filebeat 8.4.3 attempting to consume AWS vpcflow logs from S3 and output to logstash

S3 consumption is happening correctly, but ingest pipeline fails to run on messages -- the output messages simply contain original message field with no parsing of message.

aws.vpcflow module appears to load fine when reviewing output of filebeat -e -d '*' -- no obvious errors

Digging deeper, we've gone as far as watching filebeat startup via strace -f to watch every file access it makes on start. We see it access the aws.vpcflow module manifest.yml and config/input.yml but the filebeat process never reads the ingest/pipeline.yml file.

Unsure what might result in this behavior as the manifest.yml correctly references ingest/pipeline.yml in the ingest_pipeline value.

So we're at a loss for what the issue is -- appreciate any/all advice on what to look into next.

Our filebeat.yml input config:

filebeat.modules:
  - module: aws
    vpcflow:
      enabled: true
      var.queue_url: https://sqs.us-east-1.amazonaws.com/redacted/s3-vpcflow-queue
      var.role_arn: arn:aws:iam::redacted:role/svc-ap-aws-logs-s3-access
      var.access_key_id: 'redacted'
      var.secret_access_key: "redacted"

Did you run

filebeat setup -e

The AWS vpc flow ingest pipeline is on the elasticsearch side and is loaded by running setup.

You can check if it there with

GET _ingest/pipeline/filebeat-8.4.0-aws-vpcflow-pipeline

Also are there any parsing errors fields in the documents in Discover?

Can you show what the docs look like in Discover?

You can also use the ingest simulate API to test if your messages / docs are being parsed or also use the test capabilities in Kibana -> Ingest

Hi Stephen...thanks for the reply!

That is interesting, as Filebeat is responsible for parsing every other data source we use it for (Crowdstrike Falcon, GCP, AWS Cloudtrail, Cloudflare, Auth0, Okta, Palo Alto, etc.) -- I'm struggling to understand why it wouldn't be the same for AWS VPCFlow, considering the ingest pipeline definition is clearly right there in the package just like with all the other modules. Am I misunderstanding what you mean by "ingest pipeline is on the elasticsearch side"?

Our pipeline is maybe somewhat unique in that we flow events from Filebeat->Logstash->Kafka->Elastic Cloud via the Kafka Connect Elasticsearch Sink connector, so it's rather disconnected from the native Elastic Cloud integration baked into Filebeat/Logstash, but it's never been an issue before -- we have use cases to consume parsed events from the Kafka Topic (think event aggregation/enrichment with KSQL, custom Spark Streaming jobs, etc.)

Anyhow, we were caught offguard to find this behavior when the other 10+ Filebeat input modules we're using Just Worked(tm) with a simple source config, giving us a well-formed stream of parsed events on the way out the door to Logstash and beyond.

Is there some way to get Filebeat to act the same as it does with all the other input modules? It seems like all the pieces are in place. We've got a fairly well-defined deployment process (lightly customized Docker containers built off your official images re: config placement etc.) and this feels like a significant departure from the way everything else has worked out, which has been great thus far.

Appreciate the guidance as new Elastic Cloud customers looking to finally deprecate our Splunk environment in favor of you :slight_smile:
-eric

Hi @erhank

Lots to parse and understand in a discuss forum.

You could absolutely reach out to our sales and have a solution's architect work with you... That's actually my job in the field. This is all volunteer work.

Yes you are :slight_smile:

Unfortunately not every module does the parsing in the same place... Shame in us... and actually to make it worse sometime some of the parsing is in the beat and then the rest in elasticsearch.

If you look in the package what is in the config is what runs in the beat and what is in the ingest/pipeline.yml is what becomes the ingest pipeline in elasticsearch.

I checked the VPC flows are definitely parsed in the ingest pipeline that is loaded into elasticsearch during the filebeat setup phase... in fact this is the most common place logs are parse only a subset is parsed on the filebeat side.

If filebeat was directly connected to elasticsearch, filebeat would sends a pipeline parameter to indicate which ingest pipeline to execute when the documents are being written into Elastic Search.

Yes, that's exactly where the ingest pipeline gets loaded from into Elasticsearch during the setup phase.

Interesting pretty sure some of those are also parsed on the elasticsearch side have you set up templates that include default pipeline... Yup... okta and panw are also parsed in ingest pipelines in elasticsearch not on the filebeat side.

These are the pipelines loaded into elasticsearch that are doing the parsing.

I think The okta does some initial parsing in beats and then an ingest pipeline is applied so you may be getting partial parsing...

So you have something working... I am not exactly sure how / how much / what, there are a couple methods.

Your architecture is not that unusual and perfectly valid and I'm glad it's working for you and I am not suggesting you change it

However Often it's slightly different ...

With Filebeat > Kafka > Logstash > Elasticsearch

With that architecture we can make sure the correct ingest pipeline gets called from Logstash as it is carried in the metadata and logstash can make sure the correct ingest pipeline get called.

But somehow I think you are getting the pipelines called... I need to check it ... it could have to do with the new Data Streams. My understanding is the Kafka Connect does not forward / honor the pipeine meta data .. I could be wrong.

Did you setup separate index templates etc for the different log types or are you writing them all to just the filebeat index?

Did you set this all up yourself?

One thing you can do .. is have a filebeat directly connected to Elastic Cloud and enable the aws VPC module and run filebeat setup to load the pipeline. This is common approach to load the assets when you have a complex Ingest Architecture... you still need to load the assets into Elasticsearch with the new Agent that Happen on the Elasticsearch side .. not from the beats side (but conversion to agent is another topic)

I will check with some folks how / if the ingest pipelines might be being called.

I can do a little testing as well.

Can you post a couple sanitized VPC flow logs or PANW logs so I can do a little testing?

We love to hear that... I am on travel so it may take a day or two...

Agreed! Will DM you and maybe we can take this chat direct -- I pinged our account rep to track down your corp email but haven't heard back yet.

FYI all of those other data sources are being processed exclusively within Filebeat as far as we can tell. Our entire development process has been to fire it up from the command line with output.console and validate the format looks good, then redirect the output to logstash and go -- Elastic has always been the last stop on the line once we're convinced the final event format was what we wanted. We're comfortable with that and are hopeful we can figure out why aws.vpcflow isn't doing it.

The most confusing part to me is how does this one input seemingly go down a completely different execution path than all the others. Makes me want to start downgrading Filebeat but all the others we've built are pretty recent as well, this just happened to be the most recent ingest module we've built and we were just pulling current/latest.

Appreciate the support and will contact you shortly.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.