Grok S3 object "path"

Hi all,

I'm trying to figure out a way to grok out the key of an S3 object. Here is an example key:

elasticmapreduce/j-fjtnfnfk56/containers/application_1111111111111_0169/container_1111111111111_0169_01_000029/stderr.gz

I used https://grokdebug.herokuapp.com/ and came up with the following:

%{WORD:folder}/(?<cluster>[^/]*)/(?<subfolder_name>[^/]*)/(?<application_id>[^/]*)/(?<container_id>[^/]*)/(?<file_name>[^$]*)

But when I put it in my .conf file, as follows, it does't parse out the key into the fields:

filter { grok { add_field => [ "file", "%{[@metadata][s3][key]}" ] match => { "file" => "%{WORD:folder}\/(?<cluster_id>[^/]*)\/(?<subfolder_name>[^/]*)\/(?<application_id>[^/]*)\/(?<container_id>[^/]*)\/(?<file_name>[^$]*)" } } }

Even though the grok appears to work via https://grokdebug.herokuapp.com/:

{ "folder": [ [ "elasticmapreduce" ] ], "cluster": [ [ "j-fjtnfnfk56" ] ], "subfolder_name": [ [ "containers" ] ], "application_id": [ [ "application_1111111111111_0169" ] ], "container_id": [ [ "container_1111111111111_0169_01_000029" ] ], "file_name": [ [ "stderr.gz\n" ] ] }

Also, how to get rid of the \n at the end of the file name?

Any help would be greatly appreciated.

add_field will only get executed when the grok succesfully completes, so the file field will not exist when it tries to match it. Matching a non-existent field is a no-op but counts as a successful completion.

Use a literal newline in mutate to remove a newline.

mutate { gsub => [ "filename", "
", "" ] }

Thank you for that suggestion, however it only seems to have split the file field into a comma-separated line now. I tried adding the add_field option, but that didn't seem to do anything :frowning:

filter { grok { match => { "message" => "%{DATESTAMP:message_timestamp} %{LOGLEVEL:severity} %{GREEDYDATA:msg}" } add_field => [ "file", "%{[@metadata][s3][key]}" ] } mutate { copy => { "file" => "file_tmp" } split => [ "file_tmp" , "/" ] add_field => { "folder" => "%{file_tmp[0]}" "cluster" => "%{file_tmp[1]}" "subfolder_name1" => "%{file_tmp[2]}" "containers" => "%{file_tmp[3]}" "application_id" => "%{file_tmp[4]}" "container_id" => "%{file_tmp[5]}" "filename" => "%{file_tmp[6]}" } } }

Also, using the code above, I got A BUNCH of these:
[2019-07-14T20:54:07,936][WARN ][logstash.filters.mutate ] Exception caught while applying mutate filter {:exception=>"Invalid FieldReference:file[0]"} [2019-07-14T20:54:07,936][WARN ][logstash.filters.mutate ] Exception caught while applying mutate filter {:exception=>"Invalid FieldReference:file[0]"} [2019-07-14T20:54:07,936][WARN ][logstash.filters.mutate ] Exception caught while applying mutate filter {:exception=>"Invalid FieldReference:file[0]"} [2019-07-14T20:54:07,936][WARN ][logstash.filters.mutate ] Exception caught while applying mutate filter {:exception=>"Invalid FieldReference:file[0]"} [2019-07-14T20:54:07,936][WARN ][logstash.filters.mutate ] Exception caught while applying mutate filter {:exception=>"Invalid FieldReference:file[0]"} [2019-07-14T20:54:07,936][WARN ][logstash.filters.mutate ] Exception caught while applying mutate filter {:exception=>"Invalid FieldReference:file[0]"} [2019-07-14T20:54:07,936][WARN ][logstash.filters.mutate ] Exception caught while applying mutate filter {:exception=>"Invalid FieldReference:file[0]"} [2019-07-14T20:54:07,937][WARN ][logstash.filters.mutate ] Exception caught while applying mutate filter {:exception=>"Invalid FieldReference:file[0]"} [2019-07-14T20:54:07,937][WARN ][logstash.filters.mutate ] Exception caught while applying mutate filter {:exception=>"Invalid FieldReference:file[0]"} [2019-07-14T20:54:07,937][WARN ][logstash.filters.mutate ] Exception caught while applying mutate filter {:exception=>"Invalid FieldReference:file[0]"}

Also, how do you do a multi-line code block?

That should be [file][0] etc.

Thank you, the fields show up now, but the values of the fields are that value now :-/ (getting closer)
image

It's worth asking, I saw the dissect filter. Would that be a better solution in this case instead?

The dissect filter is often a better fit than grok for predictably delimited events.

input { generator { count => 1 lines => [ '' ] } }
filter {
    mutate { add_field => { "path" => "elasticmapreduce/j-fjtnfnfk56/containers/application_1111111111111_0169/container_1111111111111_0169_01_000029/stderr.gz" } }
    dissect { mapping => { "path" => "%{folder}/%{cluster}/%{subfolder_name}/%{application_id}/%{container_id}/%{file_name}" } }
}
output { stdout { codec => rubydebug { metadata => true } } }

will generate an event with these fields

"subfolder_name" => "containers",
       "cluster" => "j-fjtnfnfk56",
          "path" => "elasticmapreduce/j-fjtnfnfk56/containers/application_1111111111111_0169/container_1111111111111_0169_01_000029/stderr.gz",
        "folder" => "elasticmapreduce",
     "file_name" => "stderr.gz",
"application_id" => "application_1111111111111_0169",
  "container_id" => "container_1111111111111_0169_01_000029"

However, it does not handle optional fields, or any unpredictability except padding on separators. That's why it is so fast.

Thank you for that breakdown. For now, I'm focusing on this absolute parent path, but I would like to modify it to read in all of the other parent directories, and unknown number of sub-directories. Is this possible with either dissect, or split?

If you have a variable number of / in the path then if there is a constant number of / that constitute a prefix you could dissect that, then use split on whatever is left. For example,

    dissect { mapping => { "message" => "/%{field1}/%{field2}/%{field3}/%{restOfLine}" } }
    mutate { split => { "restOfLine" => "/" } }

would handle both "/a/b/c/1/2" and "/d/e/f/7/8/9".

I'll give that a shot, thanks!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.