S3 input issues with images

I've been trying to figure out how to solve this: I have an Amazon S3 bucket with a lot of files and filetypes. I only care about the xml files. I can't figure out how to have it only process xml files. It keeps trying to process every file and because of that the images keep causing charset errors.

Is there a way to deal with this?

The exclude_pattern or prefix options can't help?

prefix can't since the only pattern is the suffix of .xml. I tried using exclude_pattern but looking at the logs I didn't see it output anything using stdout. Am I right in thinking exclude_pattern is for the filename? It says "key" in the docs and I wasn't sure if that was the same thing.

exclude_pattern is matched against the filename.

Here is what I'm using. Am I correct in thinking this should go through every file in S3 and process any file whose filename ends in .xml?

input {
  s3 {
    // credentials omitted
    exclude_pattern => "^((?!xml$).)*$"
  }
}

#filter {                                                                                                                                   
#       xml {                                                                                                                               
#                  source => "message"                                                                                                      
#          store_xml => false                                                                                                               
#       }                                                                                                                                   
#}                                                                                                                                          

output {
  stdout {
    codec => rubydebug {
      metadata => true
    }
  }
}

No, that expression doesn't look right. Wouldn't (?!\.xml)$ stand a better chance? But negative regexp assertions isn't my forte.

I'm thinking the issue might be the regex is wrong as well. I tried yours and it isn't processing anything either.

Isn't matching file types / extensions something that most people would need to do with Logstash? I feel like it would be very useful to have a built in way to say "only process files with this or that type" on the level of a codec or something.

Isn't matching file types / extensions something that most people would need to do with Logstash? I feel like it would be very useful to have a built in way to say "only process files with this or that type" on the level of a codec or something.

It's not an unreasonable request; feel free to file a GitHub issue.

\.(?!xml) (without the anchor) matches the way you would expect, even when the .xml is non-terminal. Which, for me, is unexpected. :slight_smile:

@Badger Thanks! Is this site not working correctly though? http://rubular.com/r/yP89qfpIBB

I tested it out there and it doesn't seem to be working.

Me too. https://regexr.com/ worked for me.

Great, I will try this out. Thanks!

Looks like the site you used uses JavaScript regex, but the S3 input plugin requires Ruby regex. So complicated, and still not working. It's odd that I can't find anyone else who has had this issue.

This works as expected for me when I check on Rubular: ^((?!XML$).)*$

However, @magnusbaeck I am running Logstash with this:

exclude_pattern => "^((?!XML$).)*$"

and it shows nothing in the logs when I have this set as the output:

output {
  stdout {
    codec => rubydebug {
      metadata => true
    }
  }
} 

I've confirmed that there are .XML files in the bucket. It doesn't matter that they're nested several directories in, does it?

Also, I am somewhat confused about how the regex exclude_pattern field is supposed to work. For example, if I have regex that matches part of the filename but not the entire thing, will that file be excluded? Does it have to match the entire filename to take effect?

I don't think there are any implicit anchors, i.e. if the given expression matches the filename string that file is excluded. So partial match, if you will.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.